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Foreword 

ThiB  is  the  third  in  a series  of  reports  leading  to  the  development  of  a 
limited  continuous  speech  recognition  (LCSR)  capability  for  isolated  word 
recognition  (IWR)  hardware.  This  is  the  first  application  of  the  algorithms 
to  a specific,  real-world  vocabulary.  A combination  of  IWR  and  LCSR  was 
attempted  on  the  same  input  phrase  in  some  instances.  The  resulting  system 
was  low  in  user  acceptance.  Increased  sophistication  will  be  required  for 
further  application. 

Thanks  are  extended  to  the  command  and  staff  of  the  Fleet  Combat 
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developed  herein. 
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NAVTRAEQU IPCEN  78-C-0044-1 
SECTION  I 
INTRODUCTION 


SCOPE 


This  document  Is  the  final  technical  report  for  the  wort  performed 
under  contract  N61339-78-C-0044:  Limited  Continuous  Speech  Recognition 
(LCSR)  for  Air  Intercept  Controller  (AIC)  Training.  The  purpose  of  the 
study  was  to  Identify  operational  training  requirements  and  limitations  of 
AIC  training,  identify  AIC  vocabulary  requirements  for  a dynamic  Interactive 
training  environment,  and  develop  experimental  subsystems  for  low-cost 
implementation  of  an  LCSR  capability  for  AIC  training.  The  document  is 
intended  for  use  by  the  Naval  Training  Equipment  Center's  Human  Factors 
Laboratory  and  other  interested  parties  in  support  of  the  definition, 
specification,  and  design  of  an  experimental  prototype  training  system  for 
the  Air  Intercept  Controller. 

RELATED  DOCUMENTS 

The  following  documents  describe  work  which  is  related  to  the  efforts 
discussed  herein: 

Use  of  Computer  Speech  Understanding  in  Training:  A Demonstration 
Training  System  for  the  Ground  Controlled  Approach  Controller;  Technical 
Report  NAVTRAEQU  IPCEN  74-C-0046-1;  Logicon,  "Sac.  “uly ,'"W74. 

b.  Use  of  Computer  Speech  Understanding  in  Training:  A Preliminary 
Investigation  of  a Limited  Continuous  Speech  Recognition  Capability;  Techni- 
cal Report  NAVTRAEQU IPCEN  74-C-0048-2;  Logicon,  Inc.,  June,  1977. 

c.  LISTEN:  A System  for  Recognizing  Connected  Speech  Over  Small. 
Fixed  Vocabularies,  In  Real  Time;  Technical  Report  NAVTRAEQU IPCEN 
77-C-0096-1;  Logicon,  Inc.,  April,  1978. 

d.  Air  Intercept  Controller  Training:  A Preliminary  Review;  Technical 
Report  NAVTRAEQU IPCEN  77-M-1058-1 ; Logicon,  Inc.,  June,  1977. 

e.  NAVTRAEQU IPCEN  Specification  N— 71— 277:  Specification  for  Study, 
Limited  Continuous  Speech  Recognition  for  Air  Intercept  Controller 
Training. 

f . A Laboratory  Investigation  of  Requirements  for  Air  Intercept  Con- 
troller Training;  Technical  Report  NAVTRAEQU  IPCEN  7S-C-M53-1 ; Logicon, 
Inc.;  in  press. 
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DOCUMENT  ORGANIZATION 

Following  these  Introductory  remarks,  the  report  discusses  In  Section 
II  the  background  against  which  this  study  was  conducted,  and  formulates 
more  precisely  the  problems  addressed  In  the  study.  Section  III  describes 
the  AIC  vocabulary  addressed  by  the  effort.  Section  IV  describes  the  pro- 
grams developed  for  establishing  voice  reference  data,  and  Section  V 
describes  the  actual  recognition  software.  The  document  concludes  (Section 
VI)  with  a discussion  of  the  results,  conclusions  and  recommendations 
derived  from  this  study. 
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SECTION  II 

BACKGROUND  AND  TASK  DEFINITION 


BACKGROUND 

During  calendar  year  1977,  the  Naval  Training  Equipment  Center  spon- 
sored a research  and  development  effort  directed  toward  establishing  a real- 
time, moderate  cost  system  which  would  recognize  connected  digits  spoken  by 
a speaker  for  whom  the  algorithm  and  data  base  had  been  specifically 
tailored.  That  R&D  activity  has  just  recently  been  continued.  Neverthe- 
less, the  initial  successes  of  the  adopted  approach,  together  with  the  need 
for  this  capability  in  an  automated  training  system  for  Air  Intercept  Con- 
trollers, prompted  the  government  to  initiate  an  (admittedly  high  risk) 
study  aimed  at  developing  a speech  understanding  subsystem  (SUS)  utilizing 
the  new  recognition  techniques.  This  SUS  would  then  be  tested  in  a 
laboratory  AIC  training  model  concurrently  being  developed. 

OBJECTIVES 

The  objectives  of  this  activity  have  been  threefold: 

a.  Achieve  a greater  understanding  of  the  demands  placed  on  computer- 
based  speech  recognition  by  an  automated  AIC  training  system. 

b.  Determine  if  the  previously  demonstrated  connected  digit  recog- 
nition system  could  be  effectively  combined  with  the  more  usual  Isolated 
word  recognition  (IWR)  systems  to  satisfy  the  AIC  requirements. 

c.  Provide  an  applications  environment  for  the  continued  development 
of  the  recognition  algorithms  to  provide  focus  and  ensure  the  earliest 
possible  realization  of  an  operationally  useful  recognition  capability. 

NATURE  OF  THE  RISKS 

The  objectives  for  this  study  outlined  in  the  previous  subsection 
represent  problem  areas  relative  to  the  constraints  Imposed  by  near-term 
speech  technology  and  training  contexts  for  the  following  reasons: 

a.  An  automated  AIC  training  system  featuring  objective  performance 
measurement,  concept  sequencing,  and  adaptive  syllabus  control  does  not 
exist  - in  any  form  - today.  While  a speech-based  automated  Ground  Con- 
trolled Approach  (GCA)  controller  taining  system  has  been  investigated,  the 
automated  AIC  training  problem  represents  a significant  advance  in  both  the 
application  of  the  speech  technologies  as  well  as  training  systems 
design. 
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b.  Automatic  recognition  of  AIC  trainee  speech  presents  a formidable 
challenge.  Unlike  the  GCA  controller's  phraseology,  the  AIC  speaks  in 
relatively  free  flowing  format  with  significant  information  often  embedded 
within  the  transmissions.  However,  the  unnatural  speech  stylization 
required  to  make  IWR  work  with  the  AIC  vocabulary  is  disturbing  to  the 
trainee  and  degrades  training  effectiveness.  Therefore  IWR,  by  Itself, 
would  not  be  an  acceptable  approach  toward  satisfying  the  automatic  speech 
recognition  requirements  of  an  AIC  training  system.  A complete  LCSR 
capability  is  beyond  the  near-term  forecast  for  the  computer  based  recogni- 
tion technology.  Consequently,  a mixed  IWR  and  LCSR  system  is  the  only 
reasonable  approach;  but  it,  too,  must  be  examined  and  described. 

c.  The  demonstrated  digit  LCSR  capability  was  coded  in  a stand  alone 
environment  in  which  the  computational  resources  of  the  minicomputer  were 
not  shared  with  any  competing  functional  tasks.  Moreover,  the  demonstration 
was  entirely  context  free,  simply  outputting  recognized  digits  to  a CRT  for 
visual  display.  The  design  and  Implementation  of  an  IWR/LCSR  subsystem  for 
incorporation  into  an  AIC  training  system  remains  a challenging  task. 

d.  The  problems  associated  with  automating  the  collection  of  voice 
reference  data  for  the  LCSR  algorithms  are  significant.  The  LCSR  capability 
demonstrated  earlier  was  strictly  speaker  dependent;  that  is,  long  and 
arduous  data  collection  and  processing  was  required  for  each  speaker.  This 
method  is,  of  course,  far  from  operationally  acceptable. 

APPROACH 

The  technical  approach  taken  to  satisfy  the  objectives  of  this  study 

was: 


a.  Define  the  functional  AIC  vocabulary  associated  with  the  learning 
objectives  being  addressed  In  the  AIC  laboratory  model. 

b.  Establish  reasonable  stylization  constraints  and  thereafter  deline- 
ate the  recognition  lexicon  to  be  addressed  by  the  study. 

c.  Design  a speech  understanding  system  which  could  recognize  the 
chosen  AIC  vocabulary  with  the  minimum  stylization  constraints  consistent 
with  the  developed  technologies. 

d.  Implement  the  computer  programs  required  to  establish  voice  refer- 
ence data  and  recognize  the  selected  vocabulary. 

e.  Test  the  speech  understanding  subsystem  in  a laboratory  AIC 
training-like  environment. 

f.  Make  recommendations  for  subsequent  recognition  capabilities  in  the 
experimental  prototype  AIC  training  system. 
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SECTION  III 
THE  AIC  VOCABULARY 


THE  SCENARIOS 

Study  of  the  AIC  vocabulary  centered  around  three  scenarios  which  were 
carefully  chosen  to  represent  a sufficiently  generalized  set  of  AIC  tasks. 
The  three  scenarios  are  briefly  described  In  the  paragraphs  which  follow  in 
order  to  familiarize  the  reader  with  the  functional  environment  supported  by 
the  recognition  capabilities  described  later. 

BASIC  INTERCEPTS.  Scenario  1 addressed  the  basic  tasks  which  the  AIC  must 
perform  in  conducting  the  simplest  intercept.  As  the  exercise  unfolds,  the 
AIC  must  locate  his  assigned  aircraft  and  establish  radio  contact  with  the 
pilot.  When  a bogey  (hostile  aircraft)  is  detected,  the  AIC  communicates 
with  the  pilot  to  vector  him  to  a nearest  collision  intercept.  As  the 
exercise  proceeds,  the  controller  must  regularly  and  accurately  provide 
information  to  the  pilot  concerning  the  bogey's  bearing,  range,  heading,  and 
speed.  Scenario  1 concludes  when  the  friendly,  controlled  aircraft  comes 
into  radar  contact  with  the  hostile  aircraft. 

REALISTIC  INTERCEPTS.  Scenario  2 builds  upon  the  basic  structure  of  Sce- 
nario 1,  adding  several  complications  that  more  nearly  represent  actual  air 
intercepts.  In  addition  to  providing  bogey  position  and  velocity  updates, 
the  controller  must  now  also  detect  and  report  any  sudden  changes  in  the 
bogey' 8 heading,  and  recommend  new  vectors  to  accommodate  the  maneuvering 
hostile  aircraft.  Moreover,  the  AIC  must  detect  and  report  the  position  and 
velocity  of  other  aircraft  in  the  vicinity  of  the  controlled  aircraft. 
Finally,  the  AIC  must  respond  to  communications  from  the  pilot  at  the  point 
of  radar  contact,  at  the  time  when  the  pilot  takes  over  the  intercept  him- 
self ("Judy”),  and  when  the  pilot  loses  contact  with  the  bogey  and  needs 
additional  position,  velocity,  and  vectoring  information. 

THE  TRAINING  ENVIRONMENT.  In  addition  to  learning  to  control  aircraft  in 
combat-like  intercept  conditions,  the  operational  controller  is  often  called 
upon  to  assist  in  pilot  training  by  setting  up  mock  intercepts  in  well 
established  training  areas.  Scenario  3 addresses  this  training  environment 
and  also  provides  a more  challenging  vocabulary  requirement  for  the  AIC 
training  system.  This  is  particularly  true  because  the  transmissions  are 
largely  commands  rather  than  advisories.  Accurate  recognition  is  therefore 
essential  to  the  realistic  presentation  of  the  scenario.  Moreover,  there 
are  many  more  phrases  defined  for  this  scenario.  Scenario  3 commences  with 
two  aircraft  flying  in  formation  toward  the  training  area.  The  AIC  makes 
radar  contact  with  the  two  aircraft  and  establishes  a lost  communications 
protocol  with  the  pilots.  The  AIC  vectors  the  aircraft  to  the  area  and  then 
maintains  the  aircraft  in  the  area  by  providing  heading  changes.  The  AIC 
then  detaches  one  aircraft,  who  will  play  the  bogey,  and  turns  the  other 
aircraft  for  separation.  The  controller  determines  a planning  bearing, 
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target  aspect  angle,  and  track  crossing  angle  based  upon  the  point  at  which 
he  desires  the  intercept  to  take  place.  After  getting  the  proper 
separation,  the  AIC  turns  the  aircraft  for  the  nock  Intercept  and  Scenario  3 
continues  as  described  in  Scenario  2.  When  the  aircraft  merge,  the  AIC 
provides  breakaway  headings  and  Scenario  3 concludes  as  the  two  aircraft 
separate. 

SPEECH  CONVENTIONS 

Based  upon  the  three  scenarios  described  in  the  preceding  subsection, 
the  relevant  AIC  transmissions  were  identified.  This  vocabulary,  in  turn, 
was  studied  in  combination  with  the  limitations  Imposed  by  the  technologies 
and  the  study  objectives  of  this  project.  An  Important  area  for  investiga- 
tion was  the  extent  to  which  vocabulary  could  be  dictated  to  the  AIC  stu- 
dents. Complete  flexibility  is  neither  feasible  (from  a training  system 
design  point  of  view)  nor  desirable  (from  the  operational  point  of  view).  A 
training  system  that  demands  reasonable  standardization  is  commendable  but 
any  unnatural  speech  constraints  must  be  traded  off  against  user 
acceptance. 

In  this  regard,  certain  conventions  were  adopted  for  the  purposes  of 
this  study.  Feedback  from  the  AIC  training  community  regarding  the  accepta- 
bility of  these  conventions  is  addressed  in  Section  VI.  The  conventions 
fall  into  two  categories:  general  rules  and  stylization  requirements. 

GENERAL  RULES.  Three  general  rules  are  reflected  in  the  AIC  vocabulary 
presented  herein. 

a.  The  call  sign  (Snake  or  Viper)  shall  be  used  in  conjunction  with  a 
transmission  if  and  only  if  the  AIC  is  transmitting  information  which 
requires  some  action  on  the  part  of  the  pilot.  Advisories  or  responses 
shall  not  allow  use  of  the  call  sign. 

b.  "Over"  may  be  said  at  any  time,  but  will  always  require  a short 
pause  (about  one  second  in  duration)  preceding  it. 

c.  The  AIC  must  never  pause  preceding  t three  digit  heading,  and  must 
always  pause  following  a three  digit  heading. 

STYLIZATION  CONSTRAINTS.  Because  the  speech  understanding  subsystem  util- 
izes Isolated  word  (or  phrase)  recognition  techniques,  the  user  must  con- 
strain his  verbalizations  according  to  the  phrase  elements  predefined  by  the 
system  designers.  These  phrase  elements  are  Indicated  by  semicolons  in  the 
tables  which  follow,  and  require  a short  pause  during  voicing.  These  styli- 
zation requirements  (and,  for  that  matter,  the  general  rules  as  well)  could 
be  significantly  relaxed  in  an  environment  that  utilized  only  the  LCSR  tech- 
nology rather  than  a mixed  IWR/LCSR  approach. 
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THE  VOCABULARY  LISTS 

Table  1 presents  complete  phrases  used  in  Scenarios  1 and  2.  XXX  is  a 
three  digit  number  between  001  and  360  (bearing  or  heading),  YY  is  a number 
between  zero  and  sixty-three  (range),  and  Z is  a number  between  1 and  9 
(speed  in  tenths  of  a mach) . 

Table  2 presents  complete  phrases  used  in  Scenario  3. 

Table  3 presents  the  recognition  lexicon  derived  from  the  phrases  and 
stylization  constraints  defined  in  Tables  1 and  2. 


11 


NAVTRAEQU IPCEN  78-C-0044-1 

TABLE  1.  AIC  VOCABULARY  - SCENARIOS  1 AND  2 (27  PHRASES) 


3 


Snake  radio  check 
Snake  vector  XXX;  for  bogey 
Snake  port  XXX;  for  bogey 
Snake  starboard  XXX;  for  bogey 
Bogey  XXX;  YY 

Bogey  tracking  XXX;  speed  point  Z 
Stranger  XXX;  YY 

Stranger  tracking  XXX;  speed  point  Z 
Stranger  XXX;  YY;  tracking  D 
D:  North  Northeast 
South  Northwest 
East  Southeast 

West  Southwest 

Stranger  opening 
Bogey  jinking  left 
Bogey  jinking  right 

Roger,  that  is  your  bogey  tracking  XXX 
Negative  your  bogey  XXX;  YY 
Say  again 
Correction 

Disregard  this  transmission 

Over 

Out 

Roger 


■mm nt 


TABLE  2.  AIC  VOCABULARY  - SCENARIO  3 (70  PHRASES) 
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TABLE  3.  AIC  RECOGNITION  LEXICON  (127  ELEMENTS) 


1 

SNAKE  RADIO  CHECK 

44 

21 

87 

SNAKE  CONTINUE 

2 

SNAKE  VECTOR 

45 

22 

88 

SNAKE  BREAKAWAY 

3 

BOGEY 

46 

23 

89 

SNAKE  DETACH  PORT 

4 

BOGEY  TRACKING 

47 

24 

90 

SNAKE  DETACH  STARBOARD 

5 

STRANGER 

48 

25 

91 

SNAKE  PORT  HARD 

6 

STRANGER  TRACKING 

49 

26 

92 

SNAKE  STARBOARD  HARO 

7 

STRANGER  OPENING 

30 

27 

93 

SNAKE  TIGHTEN  TURN 

8 

BOGEY  JINKING  LEFT 

51 

28 

94 

SNAKE  EASE  TURN 

9 

BOGEY  JINtlNG  RIGHT 

52 

29 

95 

SNAKE  ANCHOR  PORT 

to 

NEGATIVE  YOUR  BOGEY 

53 

30 

96 

SNAKE  ANCHOR  STARBOARD 

11 

ROGER  THAT  IS  YOUR  BOGEY  TRACKING 

54 

31 

97 

VIPER  CONTINUE 

12 

SNAKE  PORT 

35 

32 

98 

VIPER  BREAKAWAY 

13 

SNAKE  STARBOARD 

56 

33 

99 

VIPER  DETACH  PORT 

14 

TRACKING  NORTH 

57 

34 

100 

VIPER  DETACH  STARBOARD 

15 

TRACKING  SOUTH 

58 

35 

101 

VIPER  PORT 

16 

TRACKING  EAST 

59 

36 

102 

VIPER  STARBOARD 

17 

TRACKING  WEST 

60 

37 

103 

VIPER  VECTOR 

18 

TRACKING  NORTHEAST 

61 

38 

104 

VIPER  PORT  HARD 

19 

TRACKING  NORTHWEST 

62 

39 

103 

VIPER  STARBOARD  HARD 

20 

TRACKING  SOUTHEAST 

63 

40 

106 

VIPER  TIGHTEN  TURN 

21 

TRACKING  SOUTHWEST 

64 

41 

107 

VIPER  EASE  TURN 

22 

SPEED  POINT 

65 

42 

toe 

VIPER  ANCHOR  PORT 

23 

0 

66 

43 

109 

VIPER  ANCHOR  STARBOARD 

24 

1 

67 

44 

110 

FOR  THE  AREA 

23 

2 

68 

43 

111 

FOR  SEPARATION 

26 

3 

69 

46 

112 

FOR  BREAKAWAY 

27 

4 

70 

47 

113 

FOR  BOGEY 

28 

5 

71 

48 

114 

AS  BOGEY 

29 

6 

72 

49 

113 

CORRECTION 

30 

7 

73 

50 

116 

SAY  AGAIN 

31 

8 

74 

51 

117 

ROGER  RADAR  CONTACT 

32 

9 

75 

52 

118 

SNAKE  SAY  LOST  COMMUNICATIONS 

33 

10 

76 

53 

INTENTIONS 

34 

11 

77 

54 

119 

SNAKE  TANGO  ONE  TANGO  TWO  HOT 

33 

12 

78 

33 

120 

RECOMMEND  RENDEZVOUS  POINT  SIERRA 

36 

13 

79 

36 

121 

SNAKE  MARK  YOUR  TACAN 

37 

14 

80 

57 

122 

VIPER  RADIO  CHECK 

38 

13 

81 

38 

123 

ROGER 

39 

16 

82 

59 

124 

OVER 

40 

17 

83 

60 

125 

OUT 

41 

18 

84 

61 

126 

ROGER  OUT 

42 

19 

83 

62 

127 

DISREGARD  THIS  TRANSMISSION 

43 

20 

86 

63 
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SECTION  IV 

VOICE  DATA  COLLECTION  PROGRAMS 

A requirement  of  essentially  all  speech  recognition  techniques  is  to 
make  available  to  the  system  information  which  directly  or  indirectly  re- 
flects the  vocabulary  to  be  recognized  and  the  characteristics  of  the 
speaker's  voice.  This  information  is  used  as  a referent  against  which  the 
speech  signals  are  compared  and  classified  during  the  recognition 
procedure. 

This  section  discusses  the  programs  which  are  developed  for  establish- 
ing the  reference  data  for  the  IWR/LCSR  speech  understanding  subsystem.  The 
unique  features  of  the  AIC  vocabulary  necessitated  significant  modifications 
to  the  voice  data  collection  programs  used  in  earlier  IWR-based  laboratory 
studies.  In  addition,  computer  programs  were  developed  to  perform  some  of 
the  calculations  previously  done  by  hand  in  support  of  establishing  the  LCSR 
reference  data  base. 

IWR  SUPPORTING  PROGRAMS 

The  GCA  laboratory  system  utilized  a collection  of  software,  the  Voice 
Data  Collection  (VDC)  programs,  to  create  the  reference  patterns  for  a user- 
defined  vocabulary.  These  programs  are  described  in  detail  in  Section  VII 
and  Appendix  B of  NAVTRAEQU IPCEN  74-C-0048-1 , cited  as  reference  (a)  in 
Section  I of  this  report.  Modifications  to  this  software  were  required  to 
support  the  recognition  algorithms  developed  for  the  AIC  vocabulary.  These 
changes  are  discussed  in  the  following  paragraphs. 

VOCABULARY  LIST  GENERATION.  The  Vocabulary  Ust  Generation  (VLG)  program 
was  modified  to  enable  the  creation  of  vocabulary  lists  consisting  of  up  to 
192  items.  The  software  was  updated  to  Revision  6.3  of  Data  General's  Real- 
time Disk  Operating  System,  and  was  re-written  in  FORTRAN  5. 

The  only  functional  change  to  the  VLG  program  was  in  the  area  of 
pseudo-syntax.  A new  syntax  word  was  defined  which  specified: 

a.  3 digits  must  follow  this  phrase 

b.  1 digit  must  follow  this  phrase. 

c.  This  phrase  is  complete  in  Itself. 

d.  This  is  a partial  phrase. 

e.  This  is  a pseudo-item  of  the  lexicon. 

COMMAND  FILE  GENERATION.*  The  Command  File  Generation  (CFG)  program  was  not 
significantly  modified.  It  was,  however,  updated  to  the  latest  revision  of 
the  operating  system. 
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VOICE  TRAINING  MODE.  The  actual  voice  training  program  required  significant 
modifications  to  support  the  unique  features  of  the  AIC  vocabulary  and  was 
rewritten  in  FORTRAN  5 at  the  Real-Time  Disk  Operating  System  Revision  6.3 
level.  A description  of  the  functional  procedure  utilized  in  creating  the 
reference  data  will  expedite  a description  of  the  supporting  software. 

The  IWR  phrases  were  trained  in  a fashion  similar  to  that  used  in  the 
GCA  laboratory  environment.  The  Vocabulary  List  file  was  used  with  a Com- 
mand File  to  direct  the  prompting  of  AIC  phrases  in  a semi-realist ic  order. 
The  prompting  and  data  collection  were  not  integrated  into  the  AIC  training 
however.  The  most  significant  departure  from  the  earlier  technique  was 
necessitated  by  phrases  such  as  "bogey  tracking  147".  In  order  to  minimize 
the  requirements  on  unnatural  stylizations,  no  pause  was  required  between 
the  heading  dlglto.  This  required  modifications  to  the  data  collection 
program  to  form  an  IWR-type  reference  pattern  for  the  phrases  which  were 
followed  by  three  digits. 

A set  of  10  three-digit  numbers  were  defined  which  ranged  from  the 
shortest  three-digit  number  to  the  longest,  in  the  range  001  to  360.  These 
numbers  are  shorn  in  Table  4.  The  numbers  were  chosen  based  upon  statistics 
gathered  in  the  earlier  LCSR  development  work,  and  assumed  the  user  would 
say  "niner"  rather  than  "nine".  These  ten  numbers  were  stored  as  pseudo 
vocabulary  items  (items  128  through  137)  in  the  Vocabulary  List  file,  with  a 
syntax  word  indicating  they  were  not  real  items  of  the  lexicon.  Before 
going  through  the  voice  training  procedures,  each  user  was  prompted  to  say 
these  ten  numbers,  three  times  each.  The  length  of  time  (actually,  the 
number  of  VIP  samples)  required  to  voice  these  numbers  was  saved,  and  the 
average  time  calculated.  This  number  was  input  along  with  the  user’s  name 
when  slgnlng-on  to  the  voice  training  program. 


TABLE  4. 

TEN  NUMBERS 

USED 

DURING  VOICE  TRAINING 

118 

142 

194 

255  030 

211 

173 

017 

349  096 

Each  phrase  that 

was 

followed 

by  a 

three-digit  heading  or  bearing  was 

joined  with  each  of  these  ten  pseudo-items  in  the  Command  File.  For 
example,  in  training  the  partial  phrases  "snake  vector"  and  "for  bogey",  the 
Command  File  entry  would  prompt  "snake  vector  118  for  bogey".  The  user  was 
Instructed  to  pause  only  after  the  three-digit  number;  in  other  words, 
between  "8"  and  "for"  in  the  example.  The  raw  input  data  for  the  first  part 
of  the  utterance  was  then  truncated  by  the  average  three-digit  time  calcu- 
lated earlier  and  used  to  form  a feature  pattern  representing  the  partial 
phrase  "snake  vector".  The  raw  data  from  the  second  part  of  the  utterance 
(following  the  pause)  was  used  intact  to  form  a feature  pattern  representing 
the  partial  phrase  "for  bogey".  After  ten  such  patterns  ware  collected  for 
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each  phrase  or  partial  phrase,  a voice  reference  pattern  was  formed  for  each 
and  saved  on  the  disk  file. 

In  addition  to  the  changes  necessitated  by  these  heading  and  bearing 
phrases,  the  voice  training  software  was  modified  to  create  a slightly 
different  Voice  Data  (VD)  file.  Most  notably,  the  reference  patterns  were 
not  arranged  according  to  their  time  length  as  was  the  case  in  the  earlier 
GCA  system.  Instead,  the  patterns  were  stored  by  vocabulary  index  number. 
The  VD  file  was  also  modified  to  represent  up  to  192  vocabulary  items. 

LCSR  SUPPORTING  PROGRAMS 

No  significant  changes  were  made  to  the  programs  previously  developed 
to  create  the  LCSR  reference  data.  (The  sequence  of  programs  that  were 
executed  to  form  the  LCSR  reference  data  file  is  shown  in  Table  5.)  One 
additional  program  was  developed.  In  the  earlier  LCSR  efforts,  certain 
parameters  were  determined  by  manually  fitting  curves  to  the  observed  data. 
This  "curve  fitting"  procedure  was  automated  by  the  new  program,  thus 
decreasing  the  amount  of  hand  analysis  required  in  the  reference  data 
generation  process.  (See  Appendix  B.) 
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TABLE  5.  LCSR  ROUTINES  USED  FOR  THE  GENERATION  OF  A MIND  FILE 


1.  EXTRACT  ~ digitizes,  compresses,  and  stores  voice  inputs 

2.  GWIZ  ~ lists  utterances  and  makes  a first  cut  at  marking  the  words 

within  the  utterance 

3.  MEND  ~ creates  the  example  spaces  from  the  handcut  input  data 

4.  GZEC  ~ forms  the  sets  of  transition  letter  sets 

5.  RESCUE  ~ retrieves  the  desired  transition  letter  sets 

6.  LOOPER  ~ forms  the  sets  of  loop  letter  sets 

7.  DCARDZ  ~ generates  card  images  for  machine  formatting 

8.  MAC FOR  ~ produces  the  formatted  machines  which  include  the  sets  of 

transiclon  and  loop  letter  sets 

9.  REVEXA  ~ collects  counter  data  statistics  from  training  speech  data 

10.  VERIFY  - checks  for  human  errors  in  construction  of  RVCARDS  file  of 

counter  data  to  be  kept 

11.  RVDIT  — creates  separate  counter  data  files  for  each  of  the  machines 

from  the  "good”  data 

12.  COVERT  — gathers  statistics  on  the  counters  and  computes  the  covariance 

matrix 

13.  INVERT  ~ inverts  the  covariance  matrix 

14.  CROAK  ~ prints  the  dispersion  matrix  and  computes  additional  statistics 

15.  REVEX  — revised  research  machine  exerciser:  exercises  the  machines  over 

interim  test  data 

16.  SQUISH  — performs  curve  fitting  fpntlon,  fitting  a curve  through  ob- 

served cumulative  distribution  Qx  quality  function  points 

17.  QDFIT  — fits  a parabola  to  the  set  of  computed  log  likelihood  ratio 

values  by  a least  squares  procedure 

18.  ADDER  - creates  violation  tables  for  transition  and  loop  letter  set 

violations 

19.  AVRAJ  — obtains  mean  word  lengths  from  counter  data  file 

20.  CRAP  — generates  critical  association  parameters  for  GAPSTER 

21.  GAPSTER  — determination  of  association,  gap,  and  delay  values,  Ata, 

Ata,  At am 

22.  DEALER  — generates  the  file  MIND.VD 
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SECTION  V 

SPEECH  RECOGNITION  PROGRAMS 

Essentially  all  of  the  speech  recognition  logic  was  developed  especi- 
ally for  this  effort.  The  codes  exist  as  a FORTRAN  5 program  designed  to 
run  in  the  foreground  of  a 96K-word  Eclipse  computer.  Within  the  AIC 
environment,  the  recognition  computer  shares  its  resources  with  the  Radar 
Simulation  and  Naval  Tactical  Data  System  simulation  logic  running  in  the 
background.  A Megatek  display  subsystem  is  driven  by  this  computer  as  well. 
An  inter-ground  communications  area  was  defined,  through  which  recognition 
decisions  were  passed  to  the  pilot  model  and  performance  measurement  subsys- 
tem via  the  Inter-Processor  Bus  (IPB). 

The  general  Information  flow  within  the  speech  recognition  logic  is 
shown  in  Figure  1.  MEX,  the  word  spotting  portion  of  the  LCSR  logic,  runs 
in  parallel  to  the  IWR  algorithm.  MINT,  the  word  selection  portion  of  the 
LCSR  logic,  is  invoked  if  the  IWR  logic  determines  that  the  last  recognized 
phrase  has  digits  associated  with  it. 

The  following  subsections  describe  in  greater  detail  the  unique  aspects 
of  this  speech  recognition  logic  which  distinguish  it  from  the  IWR  software 
used  in  the  GCA  laboratory  system  and  the  stand-alone  LCSR  system. 

THE  VIP  DRIVER 

The  interface  between  the  Threshold  Technology  500  Preprocessor  and  the 
recognition  software  is  a user-defined  device  driver,  VIPDR.  In  the  AIC 
environment,  this  driver  must  satisfy  the  requirements  of  both  the  IWR  logic 
and  the  LCSR  logic.  Consequently,  VIPDR  was  designed  and  written  to  fill 
one  of  two  input  buffers  with  the  raw  VIP  data,  and  simultaneously  create 
LCSR-type  letters  and  counts.  These  were  sent  on  to  the  MEX  algorithm.  The 
particular  features  to  be  used  in  creating  the  LCSR  letters  were  defined  by 
a software  mask  to  facilitate  modifications. 

IWR  LOGIC 

When  the  LP^  feature  was  set  indicating  the  end  of  a phrase,  the  IWR 
logic  was  initiated.  Because  the  unknown  phrase  could  be  either  a complete 
phrase  (e.g.,  "snake  radio  check")  or  a phrase  with  digits  ("snake  vector 
123"),  two  input  feature  patterns  were  created.  One  pattern  represented  all 
of  the  input  data;  the  other  was  formed  by  truncating  the  raw  data  by  the 
average  length  of  three  digits,  determined  in  the  data  collection  procedure. 
Each  reference  pattern  was  then  compared  with  the  appropriate  input  pattern, 
depending  upon  the  syntax  word. 

The  correlation  algorithm  was  the  same  used  in  previous  studies.  The 
comparison  routine  was  extensively  re-written,  however,  to  realize  the 
potential  for  high  speed  calculations  in  the  AIC  hardware  environment. 
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Eclipse  instructions,  Che  Eclipse  hardware  stack,  and  the  High  Speed 
Correlator  on  the  VIP  interface  board  were  utilized  in  the  new  comparison 
logic. 


The  most  significant  change  to  the  IWR  software  was  made  in  relation  to 
the  storage  of  the  reference  patterns  during  recognition.  In  the  earlier 
GCA  studies,  the  reference  patterns  were  dynamically  retrieved  from  the  disc 
in  real-time.  In  the  AIC  environment,  the  reference  patterns  are  stored  in 
extended  memory  and  accessed  by  the  comparison  logic  via  window  mapping. 
Thus,  the  reference  patterns  are  stored  outside  the  32K  address  space 
accessible  by  the  logic.  When  the  patterns  are  needed,  the  desired  memory 
blocks  are  remapped  into  the  normal  address  space  by  enabling  the  hardware 
memory  management  unit.  No  true  data  transfer  occurs.  It  is  only  the 
addresses  at  which  the  data  are  found  that  are  changed.  This  is  turn 
enables  very  high  speed  data  access. 

LCSR  LOGIC 

The  AIC  environment  significantly  limits  the  range  of  reasonable  utter- 
ances which  are  subject  to  the  LCSR  procedure.  In  the  laboratory  scenarios, 
only  three-digit  numbers  from  001  through  360  need  be  recognized.  Also,  the 
last  digit  can  be  determined  using  the  usually  highly  accurate  "digit 
extraction"  techniques  developed  for  GCA  applications.  Finally,  the  numbers 
are  always  preceded  by  non-digit  volcings,  such  as  "snake  vector". 

This  supportive  information  was  used  in  a reformulation  of  the  machine 
interaction  (MINT)  algorithm.  A detailed  description  of  those  changes  is 
presented  in  Appendix  A.  The  MINT  procedure  was  modified  in  accordance  with 
the  changes  and  Integrated  into  the  AIC  speech  recognition  software.  No 
significant  changes  were  made  to  the  MEX  algorithm  or  software  to  support 
the  AIC  laboratory  requirements. 

VOICE  VALIDATION  PROGRAM 

The  speech  recognition  logic  described  above  was  utilized  in  the  AIC 
laboratory  model.  In  addition,  the  software  was  configured  as  a stand-alone 
program  and  thus  served  as  the  framework  for  a voice  data  validation  pro- 
gram. In  this  configuraiton,  the  recognized  phrases  were  simply  echoed  on 
the  system  terminal.  As  such,  this  program  replaced  the  validation  mode  of 
the  GCA  Voice  Data  Collection  (VDC)  program. 
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SECTION  VI 

RESULTS,  CONCLUSIONS,  AND  RECOMMENDATIONS 
RESULTS  AND  CONCLUSIONS 

The  development  of  the  speech  recognition  subsystem  for  the  AIC  train- 
ing environment  has  been  a fruitful  experience.  Various  lessons  were 
learned  and  conclusions  were  drawn  that  will  Influence  subsequent  implement- 
ations. These  results  and  conclusions  based  upon  utilization  of  the  labora- 
tory system  are  summarized  in  the  following  paragraphs. 

IMPLEMENTATION  EFFORT.  Despite  the  fact  that  very  few  new  algorithms  were 
designed,  the  effort  needed  to  implement  the  AIC  speech  recognition  programs 
was  considerable.  The  modifications  to  run  in  the  FORTRAN  5,  mapped 
environment  uncovered  various  unknown  features  of  the  operating  system  and 
run-time  libraries. 

MEMORY  REQUIREMENTS.  The  AIC  recognition  logic  shared  system  resources  with 
the  display  programs.  In  order  to  maintain  the  high  speed  processing  needed 
by  both  recognition  and  display,  it  was  necessary  to  add  additional  memory 
to  the  computer.  MOS  memory  was  expanded  by  32K  words  and  integration 
efforts  were  simplified  by  having  the  display  and  recognition  software  run 
in  separate  grounds.  The  additional  cost  for  the  memory  was  more  than 
compensated  for  by  decreased  software  development  and  Integration  costs. 
The  use  of  virtual  overlays  and  window  mapping  enabled  the  recognition 
software  to  effectively  use  the  additional  memory. 

STYLIZATION  CONSTRAINTS.  The  general  rules  and  stylization  constraints  im- 
posed upon  the  AIC  vocabulary  were  not  viewed  as  detrimental  to  AIC  train- 
ing. This  view  was  shared  by  all  three  AIC  Instructors  who  were  asked  to 
evaluate  the  chosen  vocabulary.  It  is  interesting  to  note,  however,  that 
all  three  Instructors  had  initial  difficulty  in  putting  the  constraints  into 
practice.  Unlike  the  GCA  instructors,  the  AIC  Instructors  tended  to  add 
pauses  rather  than  delete  them.  Naive  users  (those  not  familiar  with  the 
AIC  environment)  had  no  difficulty  conforming  to  the  chosen  vocabulary. 

EXPOSURE  TO  THE  SYSTEM.  The  first  exposure  that  the  users  of  the  system  had 
to  speech  recognition  was  with  the  aid  of  the  stand-alone  Threshold  Tech- 
nology word  recognition  system  (V19A.SV) . The  AIC  instructors  configured 
the  system  to  recognize  just  the  digits,  and  were  then  encouraged  to  simply 
play  with  the  system  to  see  what  it  could  and  could  not  recognize.  The 
instructors  were  purposely  left  alone  during  this  exposure  period  so  that 
they  would  feel  comfortable  in  speaking  freely  to  the  machine.  This  initial 
period  gave  them  the  confidence  that  their  speech  really  could  be  recognized 
by  the  computer. 

Following  this  digit  recognition,  ten  AIC  Phrases  were  Input  to  the 
stsnd-alone  program  and  the  users  experimented  with  this  small  AIC  vocabu- 
lary. The  goal  here  was  to  impress  upon  the  instructors  that  strict 
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conformance  Co  Che  defined  vocabulary  was  an  essendal  elemenc  for  good 
recognlcion. 

IWR  TRAINING  PROCEDURE.  The  AIC  IWR  vocabulary  was  Cralned  In  Chree  separ- 
aCe  sessions:  (1)  range  calls  (1-63);  (2)  scenarios  1 and  2 phrases;  (3) 
scenario  3 phrases.  The  Chree-diglC  cime  daCa  was  also  collecCed  In  session 
1 for  subsequent  use  In  session  2.  Because  of  complicadons  In  achieving 
good  accuracy  on  chose  phrases  consiscing  of  Chree-diglC  numbers  (see 
below),  Che  Cwo  lncruccors  from  che  Fleec  Combac  Training  CenCer,  Pacific, 
did  noC  proceed  lnco  scenario  3 (session  3).  (The  resulcs  and  conclusions 
discussed  herein  are  cherefore  based  on  Che  experience  gained  In  che  Bimpler 
scenarios  1 and  2.  No  quallcacive  difference  in  recognlcion  requlremencs 
exlsCs  In  scenario  3,  buc  because  of  Che  larger  vocabulary,  1C  Is  reasonable 
Co  expecc  poorer  resulcs  for  scenario  3 chan  achieved  in  scenarios  1 and  2. ) 

DaCa  collecCed  from  four  speakers  on  che  10  chree-dlgic  numbers  is 
shown  in  Figures  2 and  3.  Speaker  G is  very  experienced  wich  speech  recog- 
nlcion syscems  and  cypically  achieves  high  accuracy.  Noce  Che  conslsCency 
In  che  Chree  repeciclons  of  each  manber.  Speakers  L and  B are  Inexperienced 
relaCive  Co  compuCer  speech  recognlcion  syscems;  noclce  Che  Inconsistency  In 
Cheir  data.  This  daca  suggests  that  time  may  be  a valid  metric  for  an  auto- 
matic algorithm  to  determine  when  voice  data  collections  should  begin  and/or 
terminate. 

Unexpected  and  severe  difficulty  was  enoountered  with  accurately 
recognizing  phrases  which  include  Chree  digits  at  Che  end.  Phrases  were 
Cypically  misrecognized  wlch  Che  substitution  error  not  intuitively 
understandable:  "bogey  123"  was  misrecognized  as  "bogey  jinking  lefc”. 
Closer  analysis  suggested  that  too  much  raw  data  was  being  discarded  from 
the  end  of  the  phrase.  Several  changes  were  made,  culminating  in  the 
formation  and  comparison  of  five  feature  arrays.  Still,  good  accuracy  was 
never  achieved.  The  conclusion  drawn  from  this  result  is  that  che  feature 
array  and  reference  array  are  very  sensitive  to  the  three-digit  heading  or 
bearing  voicing,  and  che  usual  IWR  algorithm  is  not  suitable  for  determining 
the  beginning  portion  of  the  utterance. 

J.  Porter,  who  developed  the  LCSR  algorithm,  has  suggested  a dynamic 
programming  algorithm  which,  he  believes,  would  result  In  improved  accuracy 
of  the  initial  portion  of  the  utterance.  Alternatively,  one  can  consider 
expanding  upon  the  stylization  rules  to  require  a pause  both  before  and 
after  the  three-digit  mashers. 

LCSR  TRAINING  PROCEDURE.  The  LCSR  vocabulary  was  trained  using  the  "magic 
number  sets"  developed  in  the  earlier  studies.  This  procedure  was  thus  even 
more  divorced  from  the  operational  AIC  environment  than  the  IWR  training 
procedure.  Eighteen  number  sets  were  recorded  for  one  speaker  (LHN). 
Twelve  sets  were  used  to  create  the  reference  data  and  six  sets  were  saved 
for  subsequent  use  in  the  LCSR  development  effort. 
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Three-Digit  Combinations  Ordered  by 
Increasing  Utterance  Length  (Theorized) 


Figure  2.  Timing  Figures  for  Two  speakers:  G and  N 
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Three-Digit  Combinations  Ordered  by 
Increasing  Utterance  Length  (Theorized) 


Figure  3.  Timing  Figures  for  Two  Speaker •:  B and  L 
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The  processing  to  create  just  one  file  for  one  speaker  required  two 
calendar  months  and  three  labor  months.  The  procedure  was  significantly 
complicated  by  the  facts  that  personnel  who  had  not  previously  been  Involved 
with  the  procedure  were  utilized,  and  that  many  of  the  computer  programs  had 
been  dormant  for  approximately  a year,  and  consequently  through  several 
revisions  of  the  operating  system.  Nevertheless,  the  sheer  processing 
burden  required  to  create  the  LCSR  data  base  remains  a serious  obstacle  to 
utilization  of  the  technique. 

RECOGNITION  SPEED.  Recognition  speed  was  considered  an  unknown  because: 

a.  the  vocabulary  was  large 

b.  both  IWR  and  LCSR  processing  needed  to  be  performed 

c.  the  computer  resources  were  shared  with  time-demanding  radar  simul- 
ation and  display 

Fortunately,  the  recognition  response  Is  essentially  instantaneous. 
The  effective  use  of  extended  memory,  together  with  the  powerful  Eclipse 
instruction  set  and  FORTRAN  5,  are  seen  as  the  primary  factors.  A slight 
and  occasional  stutter  In  the  radar  display  Is  the  only  external  hint  that 
the  computer  is  experiencing  a heavy  processing  load. 

IWR  RECOGNITION  ACCURACY.  Recognition  accuracy  of  complete  phrases  is  very 
good.  The  system  has  little  difficulty  with  similar  phrases  such  as  'bogey 
jinking  left"  and  'bogey  jinking  right".  The  system  has  only  occasional 
difficulty  with  ranges,  e.g.,  23,  33,  43,  53,  63. 

LCSR  RECOGNITION  ACCURACY.  Because  of  the  unexpectedly  lengthy  and 
resource-consuming  effort  required  to  create  the  reference  data,  only  one 
speaker  (LHN)  had  experience  with  the  LCSR  capability.  The  difficulties 
discussed  above  confounded  an  objective  assessment  of  the  recognition 
accuracy  for  the  three-digit  numbers,  since  the  MINT  algorithm  was  exercised 
only  when  the  IWR  process  suggested  that  it  was  appropriate.  Nevertheless, 
a subjective  appraisal  would  suggest  that  the  LCSR  performance  was  consist- 
ent with  the  previously  reported  results  for  MWG  (see  NAVTRAEQUIPCEN 
77-C-0096-1). 

RECOMMENDATIONS 

This  project  has  highlighted  the  risks  associated  with  development  of  a 
speech  recognition  capability  to  support  AIC  training,  based  upon  this  ef- 
fort, the  following  recommendations  are  offered  for  consideration  In  the 
specification  and  design  of  the  experimental  prototype  AIC  training  system: 

a.  Stylization  constraints  which  require  pauses  at  natural  points 
should  be  considered.  User  acceptance  does  not  appear  to  be  too  adversely 
Impacted  by  pauses  before  and  after  heading  or  bearing  numbers.  Pauses 
between  each  digit  are  not  acceptable,  however. 
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b.  Special  processing  should  be  developed  to  recognize  chose  phrases 
where  IWR  and  LCSR  techniques  are  both  required.  A dynaalc  programming 
based  algorithm  may  be  the  solution. 

c.  The  reference  data  generation  problem  severely  restricts  Che  appli- 
cation of  the  existing  LCSR  techniques  to  practical  applications.  Addition- 
al research  should  be  directed  toward  determining  if  an  algorithm  which 
utilizes  only  a very  small  training  sample  (say  ten  repetitions)  can  be 
developed.  Alternatively,  some  form  of  speaker  adaptation  to  a previously 
defined  data  base  may  be  appropriate.  In  any  event,  continued  R&D  is 
required  if  the  current  LCSR  technique  is  to  be  useful  in  the  prototype 
training  system. 

d.  Other  connected  speech  algorithms  (such  as  the  earlier  Logicon 
empirical  approach,  or  the  Nippon  Electric  Company  DP-100  system)  should  be 
considered  in  designing  the  AIC  prototype. 

e.  An  initial  exposure  period  should  be  designed  into  the  system  to 
teach  users  how  to  speak  for  optimum  recognition  accuracy. 

f . Speech  data  collection  must  be  an  Integral  part  of  the  training 
process. 


27/28 


NAVTRAEQUIPCEN  78-C-0044-1 


APPENDIX  A 
MINT  FOR  AIC 

Review  of  the  MINT  Problem 


The  unpublished  memo  "Reformulation  of  the  Statistical  Basis  for 
the  Machine  Interaction  Probler."  contains  an  accurate  mathemati- 
cal statement  of  the  problem  solved  by  MINT.  It  is  largely 
applicable  to  MINT  for  AIC,  so  is  reviewed  here.  The  reader  is 
also  referred  to  Technical  Report  NAVTRAEQUIPCEN  77-C-0096-1  for 
a discussion  of  the  MINT  algorithm.  The  point  of  departure  is  to 
think  of  the  set  of  potential  recognitions  picked  out  of  an  utter- 
ance by  MEX  as  a set  of  nodes,  or  abstract  points.  The  time  of 
occurrence  of  these  potential  recognitions  induces  constraints  on 
which  potentially  recognised  words  precede  which  others.  The 
nodes,  together  with  the  relation  corresponding  to  the  notion  "is 
a potential  immediate  predecessor,"  then  form  a directed  graph. 

It  is  convenient  to  append  two  special  nodes  called  "Start"  and 
"End"  to  the  set  of  nodes,  and  to  extend  the  potential  predecessor 
relation  to  include  "is  a potential  first  word"  and  "is  a poten- 
tial last  word"  in  obvious  ways. 

Th«s  primary  role  of  MEX  is  thus  considered  to  be  recognition  of 
the  start  and  end  of  the  utterance,  notification  of  potential 
recognitions  during  the  utterance,  and  recording  start  times  and 
end  times  of  potential  recognitions  by  which  pairs  related  by  the 
extended  relation  "is  a potential  predecessor  of”  can  be  identi- 
fied. These  data  create  the  directed  graph  discussed  in  the 
previous  report.  In  addition,  MEX  provides  data  to  be  associated 
with  the  nodes  and  edges  of  this  graph,  making  it  an  annotated 
directed  graph. 
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For  each  edge  of  the  graph  an  associated  time  delay  or  over- 
lap value  c an  be  computed  from  the  recognition  start  or  end 
times  of  the  nodes  at  which  it  is  incident.  A single  numerical 
value  thus  annotates  each  edge.  The  annotation  of  each  node 
is  more  complicated,  as  it  consists  of  three  data: 

i)  The  type  of  machine  responsible  for  the  potential 
recognition  (with  dummy  values  for  the  Start  and 
End  nodes}. 

li)  An  ''intrinsic  property"  value.  The  exact  nature 

of  this  value  is  secondary  to  the  fact  that  certain 
conditional  probabilities  are  known  about  it,  but 
to  be  concrete  we  note  in  passing  that  LISTEN  cur- 
rently observes  a T counter  statistic,  an  L counter 
statistic  and  a violation  category.  The  intrinsic 
property  value  can  therefore  now  be  construed  to  be 
an  element  of  the  Cartes ian  product  of  the  sets  of 
all  T counter  statistics  values,  L counter  statistic 
values  and  violation  categories. 

Hi)  A set  of  associated  machine  types.  As  described  else- 
where, when  one  potential  recognition  overlaps  another 
for  a (machine-type  dependent)  sufficient  length  of 
time,  one  potential  recognition  is  said  to  be  "asso- 
ciated" with  the  other.  MEX  supplies  the  data  whereby 
it  can  be  determined,  for  each  potential  recognition, 
what  machine  types  caused  associated  potential  recog- 
nitions . 

Th*  setting  of  the  MINT  problem  then  consists  of  a directed 
graph  with  annotations,  the  latter  consisting  of  a numerical 
value  for  each  edge  and  three  information  elements  for  each 
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node.  The  memo  cited  above  points  out  the  Importance  of  cor- 
rectly identifying  the  setting  of  the  problem  (that  which  is 
simply  "given")  and  distinguishing  the  setting  from  the  ob- 
servation. The  following  mathematization  is  useful  in  what 
follower  and  it  makes  the  important  distinction  quite  clear. 
There  are  given: 

1)  A set  of  potential  recognitions, 

2)  A set  of  (real)  machine  types , M 

3)  A set  of  nodes  Nj  -{Start , End ]u"IT 

4)  A set  of  extended  machine  types y ^ Start,  find  ^ 

5)  A relation  E ("potential  predecessor"  extended  to 
all  of  N)  or  N.  The  members  of  this  relation,  E«nan 
make  the  pair  G - (N,E)  a directed  graph.  The  time 

A 

constraints  used  in  defining  the  potential  predecessor 
relation  E,  and  the  utterances  we  deal  with,  have 
properties  which  guarantee  that  G is  acyclic  and  con- 
nected in  the  sense  that  there  is  a path  in  G from 
Start  to  any  other  node,  and  a path  from  every  node 
(other  than  End)  to  End. 

6)  A function  m;  m(n)  is  the  machine  type  responsible 

for  the  potential  recognition  n when  neTT  , else  m(n) 
in  Start  or  End. 

7)  A set  of  possible  intrinsic  properties,  I. 

An  "observation”  is  then  the  collection  of  annotations  other 
than  the  machine-type  identifier.  (That  is  the  critical  point 
in  the  cited  memo  - it  is  hopeless  to  compute  the  probability 
of  occurrence  of  individual  potential  recognitions  in  a particu- 
lar order  - the  structure  of  the  set  of  possibilities  is  too 
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complicated.  So  we  take  the  collection  of  potential  recog- 
nitions and  their  temporal  relationships  as  simply  given, 
then  define  the  hypotheses  and  observation  in  terms  of  these 
data,  thereby  avoiding  the  problem  of  computing  the  proba- 
bility of  occurrence  of  the  graph  Itself.)  In  mathematical 
terms: 


An  observation 


is  a set  of  three  functions: 


1)  ou : JT  , the  integer-valued  delay,  overlap  or  gap 

associated  with  each  edge. 

2)  &T-I  , the  intrinsic  properties  of  each  potential 

recognition  (observed  by  HEX) . 

3 ) Cft1  T+7m)  , the  set  of  machine  types  causing  po- 

tential recognitions  associated  with  each  given  potential 
recognition: 

fj  <«\d  dssockjfed  with  tt;]. 

Recall  ITj  is  associated  with  Iff  if  the  overlap  in 

their  periods  of  detection  exceeds  a criterion  c • Q '(lincHow 

of  mao  " 


32 


NAVTRAEQUIPCEN  78-C-0044-1 


The  set  of  hypotheses , (hypothetical  utterances  which  were 

spoken)  is  the  set  of  all  paths  in  G from  Start  to  End.  To 
each  such  path  there  corresponds  an  unambiguously  determined 
sequence  of  machine  types,  and  hence  of  vocabulary  items.  Note 
that  two  or  more  paths  may  correspond  to  the  same  sequence  of 
hypothetically  spoken  vocabulary  items,  so  it  is  not  strictly 
correct  to  identify  hypotheses  with  individual  utterances,  as 
there  is  really  a many-one  correspondence  between  these  two 
sets.  Note  also  that  no  hypothesis  is  considered  which  doesn't 
have  a corresponding  path  through  the  graph.  The  effect  of 
this  is  that  MEX  must  have  a machine  of  the  proper  type  going 
to  recognition  at  roughly  the  correct  time  in  order  for  MINT 
to  even  consider  the  correct  explanation  of  an  utterance. 

We  assume  that  there  is  a single  hypothesis  which  is  in  fact  true; 
i.e.,  that  there  is  a unique  sequence  of  potential  recognitions 
forming  a path  from  Start  to  End,  each  of  which  is  a real  recog- 
nition, all  others  being  artifacts.  Strict  validity  of  this 
assumption  is  questionable,  but  it  can  be  argued  that  the  single 
true  hypothesis  model  can  be  made  to  describe  the  facts  quite 
accurately  by  properly  interpreting  and  computing  the  statistical 
parameters  needed  to  solve  the  problem. 

The  problem  MINT  solves  is  the  selection  of  the  hypothesis  which 
best  explains  the  utterance,  in  the  sense  that  it  is  the  most 
probable  explanation.  Bayes'  theorem  is  invoked  indicating  that, 
for  any  H * 1 
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Erob 


Eobftf)  Prob^TLl 

Prob  (AA 


(In  words,  this  says  that  the  probability  that  any  sequence  of 
potantial  recognitions  is  in  fact  the  real  sequence , in  view  of 
the  observation , is  the  product  of  the  probability  that  that 
sequence  of  recognitions  will  arise  at  all,  and  the  probability 
that  the  observed  details  will  occur  when  that  sequence  is 
really  spoken,  divided  by  the  probability  that  this  particular 
set  of  observations  will  ever  occur) . 

In  view  of  the  fact  that  the  denominator  in  the  expression 
above  is  the  same  for  all  hypotheses  H*}f,  the  most  probable 
hypothesis  is  the  one  which  maximizes  the  product  of  a priori 
probability  and  conditional  probability  of  occurrence  of  the 
observation,  l.e.,  the  numerator  in  the  expression  above. 

K particularly  pleasing  aspect  of  this  approach  to  the  problem 
arises  when  one  imposes  a frequency -of -occurrence  interpretation 
on  the  probabilities  in  the  expression  above.  Under  this  inter-* 
pretAtion  each  of  the  probabilities  is  considered  to  be  a "frac- 
tion* of  many,  many  cases,  with  the  delightful  result  that  picking 
the  hypothesis  which  maximizes  the  expression  on  the  left  is 
identical,  in  the  long  run,  with  picking  the  hypothesis  which  is 
most  often  correct. 

Really,  of  course,  one  cannot  compute  the  product  of  probabilities 
in  the  numerator  and  thus  solve  the  problem  rigorously.  One  can 
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only  estimate  those  probabilities  using-  models  developed  from 
many  observations  of  vocal  behavior.  And  the  models  must  be 
simple  enough  to  admit  the  neces  saury  calculation  a nr}  also  the 
fitting-  of  the  model  to  the  voice  training  data.  As  an  example 
of  the  kind  of  simplifying  assumptions  which  get  imposed  are 
the  following: 

1.  The  inter-word  delays  and  the  association  and 
intrinsic  characteristics  the  potential  recog- 
nitions are  independent,  so 

“Pro  b (A  I HVfrob 

2.  The  inter-word  delays  are  independent  of  each 
other,  so 

IrobWJI  H')  * IT  Prob(^<w|H') 

3.  The  intrinsic  characteristics  of  each  potential 
recognition  are  independent  of  each  other,  as  are 
their  associations , so 

ireir 

and  Pro  b (o6l  •Jj^Prob  )lH  ^ 

The  assumption  of  statistical  independence  is  imposed  several 
more  times , causing  each  factor  in  the  products  above  to  be  fur- 
ther reduced  to  a product  of  simpler  terms.  Finally  the  needed 
product  of  the  a priori  probability  of  H and  the  conditional 
probability  of  the  observation  given  the  observation  is  ex- 
pressed as  a product  of  very  many  simple  terms.  Equally  impor- 
tant as  the  simplicity  of  this  expression  is  the  fact  that  those 
simple  factors  (each  a probability)  can  be  estimated  from  the 
training  data,  at  least  in  the  earlier  LCSR  environment.  Almost 
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all  of  this  previous  development  will  carry  over  into  the  AIC 
environment,  so  it  will  be  reviewed  here  before  the  unique 
features  of  the  AIC  problem  are  explored. 

A.  Probate)  | fl) 

Let  e*(rij,nj}be  an  edge  of  the  graph  G.  That  is,  let  rij  be  a 
potential  predecessor  of  the  node  n^ . Then  if  njL  and  n.  are 
potential  recognitions ,<ff [e)  is  the  delay  between  the  recognition 
time  of  r\j  and  the  start  of  recognition  of  a,.  If  n t is  Start, 
<ffe)  is  the  start  delay,  and  if  n^  is  endfaflT(e)  is  the  end  delay. 
(If  nt  is  Start  and  n^  is  End,*£T(e)  is  the  length  of  the  utter- 
ance. ) 

We  distinguish  only  two  cases;  either  the  hypothesis  H contains 
the  edge  e or  it  doesn't;  that  is,  either  ni  and  n^  are  con- 
secutive real  recognitions  (or  nL  is  Start  and  n^  is  real,  or 
is  real  and  n^  is  End)  or  they  aren’t ► No  distinction  is 
made  within  the  second  alternative,  so  the  delay  between  a real 
recognition  and  an  art if actual  recognition,  for  example,  is  not 
distinguished  from  the  delay  between  two  artif actual  recognitions . 
Beyond  that,  we  do  let  the  distribution  of  delays  depend  upon 
the  types  of  machines  responsible  for  the  nodes  it  joins.  Three 
particular  cases  are  considered. 

i)  When  the  nodes  a ^ and  n^  are  potential  recognitions 
(neither  start  nor  end)  the  distribution  of  delay 
values  is  assumed  to  have  this  shape: 
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ij  if 

B-ob(.&e)-d(nW  ^ e’^-^  ')  if  Id-ywl 


>>W 
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The  constants  jj  and  w depend  upon  the  machine  types 
of  n^  and  n^ , and  whether  e is  an  edge  in  H or  not. 
When  n^  is  Start  (and  Hj^lT  ) the  distribution  of 
delay  values  is  assumed  to  have  a mass  concentration 
at  zero , and  an  exponential  distribution  over  values 
of  d> 2.  * (The  value  d-1  is  given  zero  probability 
of  occurrence  because  the  preprocessor  eliminates  all 
sounds  with  a duration  of  1 count.)  Thus 

d>z. 

and  the  constants  p4  and  A depend  upon  the  machine  type 
of  n^  and  whether  or  not  e is  in  H. 
iii)  When  n^  is  Bod:  (and  n,- tTT ) the  distribution  of  delays 
is  assumed  to  have  the  same  truncated  symmetric  ex- 
ponential shape  as  case  i)  if  the  edge  is  in  B,  and 
a uniform  distribution  over  an  interval  if  the  edge 
is  not  in  H.  The  non-negative  character  of  end  delay 
values  requires  that  a special  normalizing  factor  be 
added  in  the  first  case.  Thus : 


fS.«S£35  }*•* 

4 w J 


Prob 


0 if  ia-/y'l>wj 
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where  W,  are  functions  of  the  machine  type  re- 
sponsible for  the  potential  recognition  n^,  and  m.  is 
a function  of  w and  jJ  . 


B. 


Prob  (^fr)  [ H") 


The  intrinsic  properties  of  each  potential  recognition  observed 
or  measured  by  MEX  ere  the  violation  category,  T -counter  history 
and  L-counter  history.  The  violation  category,  V,  is  a classi- 
fication of  the  type  of  violations  which  occurred  in  the  process 
of  generating  the  potential  recognition,  and  contains  options 
such  as  no  violation,  violation  of  a single  internal  transition 
letter-set  at  critical  features,  violation  of  a single  loop 
letter  set,  etc.  Denote  the  set  of  violation  categories  as 

The  T-counter  history  observed  by  MEX  is  summarized  in  the 
modlfied-Mahalanobis  metric  QT,  which  is  a measure  of  how  peculiar 
the  observed  T-counter  history  is  as  compared  to  the  training 
data.  Qj.  is  a non-negative  real  number,  and  therein  lies  a prob- 
lem for  this  discussion.  The  distribution  of  QT  values  is  most 
plausibly  modeled  as  a continuous  one  with  a probability  density. 
For  such  a distribution,  the  probability  of  occurrence  of  any 
specific  value  is  zero,  so  the  application  of  Bayes'  rule  stated 
above  (in  terms  of  probability  of  occurrence  of  the  observation, 
which  includes  a particular  value)  breaks  down.  One  rigor- 
ous correction  of  this  state  of  affairs  is  to  restate  Bayes' 
rule  in  terms  of  probability  densities  for  the  continuous  parts 
of  the  problem,  but  that  leads  to  a fragmented,  example-specific 
characterization.  Another  is  to  use  Radon-Nlkodym  derivatives 
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(generalized  densities)  to  recover  the  uniform  treatment,  but 
that  is  a bit  too  abs tract  for  this  environment. 


The  formulation  we  shall  adopt  here  avoids  the  problem  of  con- 
tinuous distributions  by  assuming  that  the  T-counter  metric  QT 
takes  on  values  in  a countable  set  Cf , rather  than  on  TRt  (Since 
we  are  dealing  with  values  computed  in  a computer,  this  assumption 
could  be  argued  to  be  closer  to  the  truth.)  The  probability  of 
occurrence  of  particular  values  from  the  set  Cl  is  assumed  to 
always  be  non-zero,  alleviating  the  problem  with  Bayes*  theorem. 
Conceptually  the  members  of  Cl  can  be  thought  of  as  (usually)  small 
intervals  on  TR  . We  shall  not  be  explicit  about  the  size  and 
location  of  the  intervals,  except  that  they  should  obviously  form 
a partition  of  R- . 


A similar,  but  even  worse, problem  holds  for  the  L-counter  history 
summary . The  summary  in  this  case  is  a non-negative  value , Q_ , 
most  reasonably  described  as  a non-negative  real  value,  Unlike 
the  values,  there  is  a definite  concentration  of  cases  at  one 
particular  value  of  QL?  viz  zero,  so  QL  doesn't  have  a density. 
Fortunately  the  same  solution  works;  we  assume  the  QL  values  are 
taken  from  a countable  set  ^ / *uch  that  each  element  has  non- 
zero probability  of  occurrence.  One  element  of  oC  i*  the  single 
value^ zero . Other  elements  of  can  be  identified  with  intervals 
of  R • Using  the  artifice  just  introduced ,. the  probability  that  a 
particular  set  of  intrinsic  characteristics  and./ 4^ 

will  be  observed  for  the  potential  recognition  T is  assumed  to  be 

• B-ob(Von 

-BsflJOftKW 
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reflecting  the  assumption,  that  the  three  characteristics  (quality 
category,)  QT  and  QL  are  all  independent.  The  functions  I,  J,  K 
above  are  probability  distributions  on  the  sets  ZJ  and  j£and 
are  assumed  to  depend  upon  whether  or  not  TT  is  in  B.  The  func- 
tions J and  K also  depend  upon  the  type  of  machine  responsible 
for  TT  . 

C.  Prob  ft). 

Recall  that  cb(&)  is  a collection  of  machine  types  (a  subset 
of  M) . Machine  type  m is  a member  of  cjfc^ir)  if  there  is  some 
other  potential  recognition,  T7  , which  overlaps  the  potential 
recognition  IT  in  time  sufficiently  to  establish  -IT7  as  associ- 
ated with  T , and  m is  the  machine  type  of  TT7  . A recognition 
by  a particular  type  of  machine  is  assumed  to  have  various  prob- 
abilities of  having  associated  recognitions  by  various  machine 
types.  As  usual,  we  assume  these  possibilities  are  independent, 
and  depend  only  on  the  machine  type  m,  the  machine  type  respon- 
sible for  TT  and  whether  or  not  TT  is  real,  i.e.,  TTeH 
Thus 

Pro  b I H)  “ TT  A^rn'1 

i m tate*}  m 

where  is  a set  of  probabilities  on  M,  depending  upon  the 

machine  type  responsible  for  TT  and  whether  or  not  TTe  H 

D.  Prob(H) 

This  term  is  the  a priori  probability  that  the  hypothesis  H is 
true.  It  answers  the  following  question:  Given  the  directed 
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graph  of  potential  recognitions,  what  is  the  probability  that 
the  particular  path  H from  Start  to  End  is  the  set  of  real 
recognitions,  irrespective  of  the  delay  data,  instrinsic  rec- 
ognition characteristics  and  association  data?  Under  the 
frequency  interpretation  of  probabilities,  this  number  also 
has  the  following  significance.  Suppose  MEX  is  exercised  an 
arbitrarily  large  number  of  times.  Consider  those  cases  where 
the  timing  and  the  set  of  potential  recognitions  is  represented 
by  the  given  directed  graph  (without  the  delay,  intrinsic  char- 
acteristic and  association  data  annotated).  Then  Prob(H)  is  the 
fraction  of  those  cases  in  which  the  hypothesis  H is  in  fact  the 
correct  path  through  the  graph. 

Estimation  of  the  a priori  probability  that  H is  the  correct 
hypothesis  is  a difficult  issue.  The  difficulty  arises  because 
of  the  conditioning  requirement  that  the  directed  graph  of  the 
should  be  the  one  which  in  fact  was  observed.  For  a 
given  operational  environment  it  should  be  possible  to  estimate 
the  probability  that  a particular  sequence  of  vocabulary  items 
would  be  spoken,  and  by  extension  the  various  hypotheses  could 
be  given  a priori  probabilities  proportional  to  the  probabilities 
of  occurrence  of  their  associated  vocabulary  strings,  normalized 
if  necessary  to  account  for  the  many-one  correspondence  between 
hypotheses  and  utterances.  This  procedure  is,  however,  unsound, 
as  it  can  only  be  justified  by  the  improbable  assumption  that 
different  hypotheses  within  a given  graph  are  associated  with 
vocabulary  strings  which  are  equally  likely  to  give  rise  to  a 
directed  graph  isomorphic  to  the  given  one;  i^e.,  that  graph  struc- 
ture and  utterance  content  are  statistically  independent! 
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in  the  earlier  LCSR  environment,  where  utterances  of  any  length 
and  all  digits  could  be  expected,  it  was  natural  to  treat  all. 
hypotheses  as  equally  likely  a priori.  An  alternative  to  this 
assumption,  which  turned  out  to  have  remarkably  little  impact 
on  recognition  accuracy,  was  to  assume  that  the  a priori  prob- 
ability of  a given  hypothesis  a in  a graph  G being  correct  is 
the  product  of  the  probabilities  that  H correctly  lables  each 
potential  recognition  either  "real"  or  "artifact"  correctly 
(recognitions  in  H are  labeled  "real",  all  other  "artifact"), 
and  that  the  probability  that  any  particular  potential  recog- 
nition is  real  or  art if actual  depends  only  on  its  machine  type, 
and  is  revealed  by  its  rate  of  occurrence  in  real  or  artifactual 
form  in  the  training  sample.  This  leads  to 

Probf'tt')  - 1 ]T Aa(m(ir^ 

ir«H  ir<fH 

where  Aa  and  Aff  are  probability  values  on  the  set  M of  machine 
types.  This  method  of  estimating  the  a priori  probability  of 
H has  no  rigorous  basis  and  may  differ  considerably  from  the 
truth.  It  seems,  in  particular,  to  make  use  of  the  same  data 
as  are  used  in  computing  the  probability  of  occurrence  of  asso- 
ciation between  various  types  of  machines,  and  may  be  a redundant 
introduction  of  the  same  type  of  effects. 

A rigorous  approach  to  estimating  the  a priori  probabilities 
of  various  hypotheses  might  start  with  accumulating  examples 
represented  by  a given  oraph,  and  observing  the  relative  f re- 
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quency  with,  which  Individual  hypotheses  are  correct  in  this 
collection.  But  this  computation  is  not  practical,  as  the 
number  of  graphs  far  exceeds  the  number  of  input  utterances, 
which  is  itself  a large  number.  Even  if  the  computation  could 
be  performed,  how  to  represent, store  and  retrieve  the  resulting 
data  is  a difficult  problem.  In  the  absence  of  such  am  approach, 
a simple  theory  for  the  rate  of  occurrence  of  hypotheses  and 
graphs  could  be  used.  I have  not  been  able  to  posit  such  a 
theory  in  a satisfying  way ; i.e.,  in  a way  which  leads  to  a 

simple  and  plausible  calculation.  I believe  any  thoroughly  sat- 
isfactory solution  will ran tail  revision  of  the  concepts  of 
association  and  the  potential  predecessor  relation,  as  well  as 
the  nature  of.  an  hypothesis. 

In  the  AIC  environment,  however,  some  unquestionable  conclusions 
can  be  reached  about  the  relative  a priori  probabilities  of  var- 
ious hypotheses.  In  fact,  it  is  precisely  in  this  factor  that  the 
information  unique  to  the  AIC  environment  is  most  naturally  used. 

The  details  are  discussed  later,  in  connection  with  the  new  information 

How  MINT  Solves  the  Problem 

As  described  in  the  LISTEN  final  report,  MINT  solves  the  problem 
of  finding  the  hypothesis  in  the  graph  G with  highest  posterior 
probability  by  a dynamic  programming  procedure.  Crucial  to  the 
derivation  of  that  procedure  is  the  fact  that  the  problem  can  be 
converted  into  the  task  of  finding  the  hypo thesis /path  in  G with 
minimum  "cost" , where  the  cost  is  a sum  of  values  associated 
with  each  potential  recognition  and  edge  in  an  hypothesis.  These 
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individual  costa  ara  negative  logarithms  of  ratios  of  prob- 
abilities described  above.  As  the  MINT  procedure  for  AIC  will 
be  a modification  of  the  current  version,  we  must  review  the 
definition  of  these  costs  and  how  they  are  computed.  The  de- 
scription below  differs  somewhat  from  that  given  in  the  final 
report,  but  the  same  object  is  being  described  in  both  cases. 
(The  description  given  here  may  make  the  connection  between 
LISTEN's  data  base  and  voice  data  collection  and  processing  a 
little  clearer.) 

The  following  definitions  are  useful. 

As  described  above  in  connection  with  inter-word  delays  and 
overlays,  the  probability  of  occurrence  of  a given  value  of 
delay  is  assumed  to  depend  upon  whether  or  not  it  is  a delay 
bctmcn  real  recognitions , and  also  on  the  types  of  machines 
responsible  for  the  recognitions  it  joins.  The  same  concept 
is  extended  to  initial  and  final  delays,  if  we  let  the  ex- 
pression " • indicate  T or  P depending  upon  whether  the 

•d**  ®ij  joining  nodes  and  n^  is  part  of  the  hypothesis  H, 
we  are  assuming  the  existence  of  a function,  , of  four  var- 
iables such  that  ' 

ProbC-fo-.p.  <||  H)  - (4,  m(n,\ni(np,  efj-  * H-) . 

(The  arguments  of  ^ are  integers,  two  extended  machine  types 
and  either  T or  F.) 
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The  probability  of  occurrence  of  a given  quality  category  for 
a potential  recognition  is  assumed  to  depend  only  upon  whether 
or  not  that  potential  recognition  is  real  or  art if actual,  using 
the  expression  " TT < H " to  denote  T or  F according  as  TT  is, 
or  is  not,  in  H , it  i*  natural  to  define  the  function  p^  of 
two  variables  such  that 

Prob  - p,,  (q.  irettY 

The  probability  that  a potential  recognition  TT  has  a particular 
T counter  history  qualtiy  value  t(inO)  is  assumed  to  be  a func- 
tion of  the  machine  type  responsible  for  TT  and  whether  or  not  TT 
real.  We  therefore  define  the  function  p.  of  three  variables 
such  that  ' 

Prob(Qr<W-t|H')-  f^(t, 

Treating  L-counter  history  values  in  similar  fashion  we  define  p. 
by  • 

The  probability  that  a potential  recognition  TT  , caused  by 
machine  type  m,  has  an  associated  potential  recognition  (one 
or  more)  caused  by  machine  type  m1 , is  assumed  to  depend  upon 
the  machine  types  m and  m' , and  whether  or  not  TT  is  real.  In 
the  notation  given  previously,  it  becomes  natural  to  define 
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the  function  of  three  variables  such  that 

Prob  (Urt’AlH)  -F  Ajm)  F 

"«A  h ^ 

where  A.Cm'i  “ |>/m,  men,  . 


Th.  problem  of  picking  the  most  probmbls  hyjwthesls  can  now  bm 

transformed  to  the  problem  of  finding  the  least  cost  path  through 

0 as  follows,  using  the  probabilistic  model  described  earlier, 
we  have 

Vro b(H | JL^ * *^n  R*ob^H>)  +>6i  R*ob 


where 


-2>  Blob  (£kei-')  »d;;  I H) 

V&  J J 

IfyWn  JftiVXnKCfi')') 

irjeTT 


i*  the  delay  noted  for  edge  e^,  and 

ere  the  intrinsic  characteristics  observed 
for  potential  recognition  TT- 


the  set  of  machine  types  associated  with  TTj 


1 
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Using  the  functions  defined  above/ 

-in  B-ob  (fl|  EVo  bflO  tin  ~£ro  b (JV) 


Ve 

fci  » f,  * ^ 

^6ir  +j£n 


•h 


[L-fo  + 

m»A; 


Perhaps  the  most  important  thing  about  this  expression  is  that 
it  entails  sums  over  all  edges  and.  potential  recognition  nodes 
in  the  graph/  and  can  be  shown  to  differ  by  a constant  from  a 
sum  just  over  the  edges  and  nodes  of  the  hypothesis  H.  The 
constant  is 


tin  Prob  (Si)  • 
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whence 


~Ar\  - -Jl n ProbOO 

jy  pd j tTT^nl^ ) rn(,np?T>)  | 

TTjStt1  Pt<<fj>F)  pt^.m^F)  pt^^itj-i.F) 

LU;  IWK-Oftrt  m^Afl-^m.m^ 

+ K. 


Defining  the  cosh  to  be  associated  with  each  edge  a»d  node  in 
the  obvious  way, 

-jCn  cij  +2Lci  -L<+^n  • 


V14 


IT  *4 


The  hypothesis  with  maximum  posterior  probability  of  being  cor- 
r*c't  ^lu*  tka  path  of  least  total  cost  when  all  the  hypotheses 
have  equal  a priori  probability  of  occurring,  as  in  that  case 
the  term  in  brackets  in  the  expression  above  is  identical  for  all 
hypotheses . 
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Wh«n  the  alternative  treatment  of  a priori  probability  described 
earlier  is  imposed,  one  proceeds  as  follows.  Under  the  alter- 
native treatment,  the  a priori  probability  of  a hypothesis  B is 

assumed  to  be 

Probin')  - j[  ]j" 
ir«tt  tt*W 

as  described  in  the  discussion  of  the  difficulties  connected 
with  a priori  probabilities.  Here  Br  and  are  two  probability 
distributions  on  the  set  of  machine  types.  Defining 

i! 

*nd  K'-  K<-I>  B 

T,«f 

one.  obtains  -in  BwbCHjJV)  ■ 21  cij 

Tdtt  ‘ 

J i 

So  the  hypothesis  with  maximum  posterior  probability  is  again  the 
one  with  minimum  total  cost,  after  adjusting  the  cost  associated 
each  potential  recognition  in  a very  simple  way. 
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To  illustrate  the  mechanics  of  the  procedure , and  the  significance  of 
the  statistical  models  of  the  variables,  consider  the  computation  of 


the  cost  associated  with  an  edge  S. 


The  models  given  before  indi- 


and  if  ft,.  is  incident  at 

Start,  or  W and  p if  gi;  is  incident  at  two  potential  recognitions 


cate  that  the  probability  of  occurrence  of  a particular  delay  value 
depends  upon  distribution  parameters  po 

if  e,:  4-  4 — J — 

Each  of  these  distribution  parameters  is  assumed  to  be  dependent  upon 
the  machine  type  responsible  for  the  potential  recognitions  at  which 
the  edge  is  incident,  and  whether  or  not  the  edge  is  real.  When  ft- 
is  incident  at  End,  the  distribution  of  delay  values  depends  upon  ^ 
parameters  we  shall  denote  We  and  fj^  if  the  edge  is  real,  and  and 
if  the  edge  is  not  real-  The  parameters  through  jj£  are  all 
dependent  upon  the  machine  type  of  the  potential  recognition  at  which 
ftjj  is  incident.  Using  the'  symmetric  truncated  exponential  distri- 
bution forms  described  earlier,  it  can  be  seen  that 


C‘j 


if  <Jj. 


^ if  n;  •Start 


where 


+ Max  (o,  14'i^Afal  if  • En<i 
ij  +■  Max  ( 0j  - i)  - Max  (o, 


L . -^n  ACm<Hi\T)0-p.frnfrQ,n) 
P*  A(,m(nj),Fr)(l-pa(mfnJ'),F)) 


0W. 
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Tfj  »A(m(’np,f')  - Afafn/),  F) 

Amtn(i££) 

we<WntTl 
we'-  weY">(M 
fJ€~  fcfatoi'h 

is "tb*  normaliacrHon  -feet or  associated 
with  **<**  fJm 

*i  "^n  (S0 

w<in4tt,iKtyVO 

*«»  W (mflip , mfnp , F) 

Hr* 

/V*  fJ(nfri\tntoi\r). 


Thsss  equations  reflect  the  assumption  that  tha  distribution  of  and 
daisy  values  is  uniform,  with  danaity  l/v^,  ovar  tha  anti ra  intarval 
of  interest.  , 

Tha  matrix  GAP  in  MINT'S  data  basa  contain  numbers  analogous  to  or 

^)/fe  Wa  , from  which  tha  cost  C«  is 

computed.  J 

Tha  cost  associatad  with  aach  noda  is  computed  in  similarly  simpla 
fashion.  Th are  are  contributions  due  to  tha  nodes'  observed  quality 
category,  T -counter  statistic  0^,  L-countar  statistic  QL,  and  sat  of 
associated  Machine  types,  A.  A term  is  computed  for  aach  of  these 
contributions , as  a function  of  tha  observed  data  and  sometimes  of 
tha  machine  type  responsible  for  the  noda. 
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Since  the  probability  of  occurrence  of  quality  category  is  assumed 
to  be  independent  of  machine  type,  the  term  contributed  by  quality 
category  is  a function  of  quality  category  only.  A function  <J/  of 
quality  category  is  therefore  used,  where  for  quality  category  ^ , 

CJtt')  - -Jn  Pi 
1 P^VO  7 ^ 

The  log  likelihood  ratio  for  T-counter  statistic  is  assumed  to  be  a 
quadratic  function  of  the  T-counter  statistic  value,  up  to  a certain 
limit  beyond  which  it  is  assumed  to  be  a constant.  The  parameters 
are  machine-type  dependent.  The  contribution  due  to  T-counter  values 
is  therefore  computed  as 


t^mflrVr/nnfr'rit  Arijt*" 

if 


if  t 


The  machine-type  dependent  parameters  CB  thru  are  estimated  from 
values  observed  in  training  data  by  a rather  complicated  process. 

The  log  likelihood  ratio  for  the  L-counter  statistic  QL  can  be  com- 
puted from  the  assumed  distribution  of  QL,  which  is  a mass  concentration 
at  zero,  and  an  exponential  distribution  over  positive  values  of  Q^. 

The  distribution  is  dependent  upon  machine  type  and  whether  or  not 
the  potential  recognition  is  real;  this  leads  to  the  computation 

3 
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( 


p<rt(r')Jm(7r'))  F) 


Finally,  the  log  likelihood  ratio  of  associations  among  recognitions 
of  various  machine  types  is  stored  as  a matrix  of  values 


and  a vector  of  values 


+in  ('l~P<>^n'1 


m7M  ('"Pafm.m',  FI)  ' 

which  facilitate  the  computation  of  cost  due  to  association,  as  it 

can  be  found  by  a number  of  additions  equal  to  the  number  of  asso- 
ciated machine  types,  as 


ejm 


CaM  • 9|  fn\,  mrtri) . 


MINT  finds  the  hypothesis  of  minimum  coat  by  keeping  a pointer 
for  each  node  indicating  the  immediate  predecessor  in  the  lowest 
cost  path  through  that  node.  The  additive  property  of  the  cost 
allows  one  to  find  this  "optimal  predecessor"  by  considering  only 
the  total  cost  of  the  optimal  paths  leading  to  each  potential  pre- 
decessor, and  the  cost  of  the  edge  in  common  with  the  potential 
predecessor.  To  implement  the  process  it  is  thus  necessary  to  keep 
both  the  pointer  to  the  optimal  predecessor,  and  the  total  cost  of 
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the  optimal  path  from  there  to  Start.  Let  C* be  the  cost  of  the 
optimal  path  from  Start  to  the  node  . Then  the  recursive  property 
of  costa  which  is  exploited  in  this  dynamic  programming  algorithm  ia 


C 


+•  Min 

rtj  ii  a potential 
predecessor  of  n,- 


and  of  course  the  optimal  predecessor  of  fl(  is  the  node  Hi  which  min- 
imizes the  cost  ( Cji  ■h  Cj  ) . When  End  is  finally  reached,  the  optimal 
path  through  the  whole  graph  can  be  recovered  by  following  the  optimal 
predecessor  links  upward. 


New  Information  in  the  AIC  Envirnniii»n» 

The  utterances  to  be  processed  by  LISTEN  in  the  AIC  nevironment  differ 
from  those  encountered  in  the  LCSR  project  in  that  they  consist  of  a 
combination  of  non-numerical  and  numerical  data.  The  special  features 
of  this  environment  allow  one  to  state  a priori,  with  high  probability, 
that  these  utterances  consist  of  an  initial  segment  of  non-numerical 
data  and  a final  segment  consisting  of  3 digits,  representing  a number 
between  000  and  359,  inclusive. 

In  addition,  the  Isolated  Word  Recognition  capability  will  be  used  to 
estimate,  with  something  like  90%  accuracy,  the  last  digit  spokan. 

MEX  will  be  operating  throughout  the  utterance,  and  no  doubt  many 
artifactual  digit  recognitions  will  occur  during  the  non-digital 
portion  of  the  utterance. 
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The  kinds  of  knowledge  available  for  choosing  the  best  explanation  of 
an  utterance  in  the  AIC  environment  depend  upon  the  voice  data  collec- 
tion procedures  to  be  followed,  as  well  as  the  vocal  behavior  imposed 
by  the  AIC  task.  In  particular,  it  is  assumed  that  voice  data  will  be 
collected  and  processed  for  AIC  trainees  much  as  it  was  for  LCSR  sub- 
jects. The  training  utterances  (for  connected  word  recognition)  will 
consist  of  connected  strings  of  digits,  and  not,  for  instance,  the 
non— digital  portions  of  the  AIC  vocabulary.  This  precludes  measuring 
the  frequency  with  which  various  digits  are  falsely  recognized  in  the 
non-digital  portion  of  these  utterances  - a datum  which  could,  in  prin- 
ciple, be  used  in  the  recognition  procedures. 

In  terms  of  the  directed  graph  of  an  utterance,  the  problem  of  picking 
the  best  explanation  for  the  utterance  can  again  be  interpreted  as 
choosing  the  path  through  the  graph  which  maximizes  the  product' of  a 
priori  probability  of  occurrence  and  conditional  probability  of  pro- 
ducing the  observed  data.  The  a priori  data,  derived  from  the  nature 
of  the  AIC  environment,  include  the  facts  that  these  are  probably  at 
least  three  digits  in  the  utterance,  constituting  a number  between  000 
and  359,  that  any  additional  recognitions  are  probably  false,  the 
three  real  digits  are  the  last  in  the  utterance,  and  that  the  last  digit 
is  probably  the  one  specified  by  the  IWR  subsystem. 

As  only  digits  will  be  spoken  during  the  training  period,  it  will 
be  assumed  that  nothing  is  known  about  artif actual  recognitions 
which  may  occur  during  the  non-digit  portions  of  an  utterance. 

Although  some  data  of  this  sort  could  be  collected  during  operation, 
this  adaptive  type  of  approach  will  not  be  adopted  now.  We  shall 
instead  base  the  selection  of  the  best  explanation  of  an  utterance 
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exclusively  on  the  characteristics  of  the  last  three  potential  rec- 
ognitions in  each  hypothesis,  ignoring  entirely  the  characteristics 
of  the  remainder  of  the  hypothesis,  except  for  its  duration,  as  ex- 
plained below. 

The  start  time  of  a potential  recognition  is  particularly  important 
when  that  potential  recognition  is  being  considered  as  a candidate 
for  the  first  digit  spoken,  as  in  that  case  the  start  time  is  equal 
to  the  length  of  time  it  took  to  speak  the  initial  non-digital 
portion  of  the  utterance.  In  the  AIC  environment  there  jure  several 
possible  non-digital  initial  utterance  segments,  such  as  "Come  left 
to  heading...”.  These  utterances  will  be  recognized  and  discrimi- 
nated by  the  IWR  portion  of  SOS.  Furthermore,  the  mean  time  to  speak 
the  non-digital  portions  of  these  utterances  will  also  be  known  to 
the  IWR  subsystem,  and  can  be  used  as  an  additional  factor  in  dis- 
criminating among  hypotheses,  by  comparing  it  to  the  start  time  of 
a hypothetical  first  spoken  digit. 

m 

Ttl*  besic  dynastic  programming  scheme  used  in  MINT  to  find  the  best 
explanation  of  an  utterance  does  not  apply  in  the  AIC  environment 
J**ast  without  modification)  because  individual  nodes  do  not 
have  unique  optimal  predecessors.  To  illustrate  this  fact,  consider 
a potential  recognition  whose  corresponding  vocabulary  item  is  "7”. 
This  potential  recognition,  like  others,  may  lie  in  several  hypoth- 
eses. Suppose  it  has  potential  predecessors  corresponding  to 
vocabulary  items  ”2”  and  ”3",  and  that  the  ”3"  is  intrinsically  a 
nor*  attractive  potential  recognition,  on  the  basis  of  nominal 
violations  and  very  nominal  T and  t Counter  values,  for  instance. 

For  a hypothesis  in  which  the  "7"  is  the  next-to-last  digit  spoken, 
the  "3"  is  an  unlikely  immediate  predecessor  because  the  associated 
interpretation  is  a number  greater  than  359.  However,  for  a hypo- 
thesis in  which  the  ”7*  is  the  last  digit  spoken,  the  ”3"  is  an 
acceptable  immediate  predecessor  if  it  has  an  immediate  predecessor 
of  suitable  type  (0,1,2,  or  3).  The  optimal  predecessor  of  a node 
is  thus  dependent  upon  the  position  of  the  node  in  an  hypothesis. 
Treating  it  as  such  solves  the  problem,  as  will  be  shown. 
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Formal  Baals  for  New  MINT 


To  describe  new  AIC  version  of  the  MINT  algorithm,  we  extend  the 
formal  definition  of  the  problem  to  include  the  directed,  graph 
and  the  elements  of  the  observation  (the  functions  , cf*  and  ) 
described  before,  and  add  a function  which  gives  the  start  time 
of  each  potential  recognition  (measured  in  "counts*  from  the  be- 
ginning of  the  utterance) . This  is  a function  from  the  set  7T  of 
potential  recognitions  into  the  set  of  natural  numbers: 

& T—  N 

The  observation,  .£L  , is  now  a 4-tuple  ( , rather 

than  the  3— tuple 

As  we  expect  only  three  digits  to  be  spoken,  and  those  at  the  end 
of  the  utterance,  it  is  appropriate  to  modify  the  definition  of 
an  hypothesis  to  consist  of  a path  in  the  directed  graph  descrip- 
tive of  the  utterance,  not  from  Start  to  End  as  before,  but  from  a 
potential  recognition,  through  two  others  to  End.  As  there  is  some 
finite  probability  that  a trainee  may  incorrectly  speak  other 
three  digits,  it  might  seem  appropriate  to  also  consider  hypotheses 
with  other  than  three  digits,  but  with  lower  a priori  probability. 
There  is  definite  merit  to  this  viewpoint,  but  in  the  interest  of 
simplicity  we  shall  restrict  attention  to  three-digit  hypotheses. 
This  should  seldom  lead  to  erroneous  speech  understanding  in  the 
AIC  environment,  although  it  will  definitely  lead  to  erroneous 
speech  recognition  when  other  than  three  digits  are  spokfn.  The 
reason  for  this  is  that  the  output  of  the  speech  recognition  pro- 
cedures (a  three  digit  string)  will  be  compared  with  what  was 
supposed  to  have  been  spoken,  and  it  is  very  unlikely  that  the  new 
MINT  will  produce  the  "correct*  response  when  it  was  not  in  fact 
given  by  the  trainee.  Perhaps  the  most  likely  case  leading  to  such 
erroneous  "correct”  response  would  arise  when  the  trainee  speaks 
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four  or  more  digits  consisting  of  the  correct  three  digits,  pro- 
ceeded by  some  extraneous  digits.  By  considering  four  (or  more) 
digit  hypotheses  the  new  MINT  would  have  some  potential  to  detect 
this  kind  of  erroneous  response.  However,  the  data  structure  and 
the  processing  time  grow  with  the  maximum  number  of  digits  to  be 
considered,  and  the  added  potential  is  not  worth  the  added  imple- 
mentation cost. 

Implementation  costs  are  also  the  basis  for  restricting  attention 
to  responses  on  the  interval  000-359.  In  reality,  magnetic  head- 
ings are  given  as  numbers  on  the  interval  001-360?  that  is,  mag- 
netic north  is  indicated  by  the  string  360  rather  fchan  000 . MINT 
is  intrinsically  concerned  with  contiguous  pairs  of  potential  rec- 
ognitions, as  it  uses  the  potential  predecessor  relation,  and  the 
cost  associated  with  an  edge  (and  an  edge  is  really  a potential 
contiguous  pair  of  recognitions),  so  it  is  natural  and  easy  to  deal 
with  the  interval  000-359,  as  it  can  be  associated  with  the  set  of 
all  hypotheses  whose  first  contiguous  pair  correspond  to  numbers 
in  the  interval  00-35.  Special  logic  and  perhaps  special  data 
structures  have  to  be  introduced  to  treat  correctly  the  real  inter- 
val, 001-360,  and  as  only  one  value  of  the  360  possible  is  concerned, 
it  is  not  worth  the  added  implementation  burden  to  accommodate  the 
real  interval . 


Hypotheses  will  thus  consist  of  three  potential  recognitions  and 
the  edges  connecting  them.  The  changes  of  definition  of  the  obser- 
vation and  a hypothesis  in  no  way  invalidates  the  statistical 
decision  theory  approach  or  its  solution  through  the  application  of 
Baye's  theorem.  Me  still  select  the  hypothesis  with  pos- 

terior probability; 


Mbl) 


Pd) 
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and  as  the  probability  of  the  observation  is  cononn  to  all  hypo- 
theses , we  are  again  led  to  detect  the  hypothesis  with 
product  of  a priori  probability  and  probability  of  producing  the 
observation . 


The  a priori  probability  of  each  hypothesis  will  be  modeled  as  the 
product  of  three  probabilities;  one  depending  upon  whether  the 
hypothesis  corresponds  to  a number  in  the  interval  000-359 , one 
depending  upon  whether  the  last  potential  recognition  in  the  hy- 
pothesis agrees  with  the  IWR  prediction  of  the  last  vocabulary  item, 
and  one  depending  upon  the  a priori  probability  used  in  the  older 
LISTEN  system.  (As  discussed  earlier,  there  are  two  versions  of 
a priori  probability  in  the  current  system.)  Thus  the  new  a priori 
probability  of  a hypothesis  containing  potential  recognitions  TTj- , TTJ , TT^ 
in  that  order  is 


wfcere. 


, m fw-  ^ fa 

fjCL  if  rn  m in  tocres  ponds  iv  a 
number  in  000-3S«t 


dUenvise.. 


P (rri)  is  the  a prion'  probability  Hbaf  4be  last 
spoken  *as  o-f  fype  rn.  Received 
from  *rHe  XWR  subsys-rem^- 

?*(W)  is  4be  currervHy  compu+ed  a priori 
probability  of  H. 


Notice  that  is  simply  the  a priori  probability  that  the  trainee 
will  in  fact  respond  with  a number  in  the  interval  000-359.  Also 
notice  that  in  this  formulation  a vector  of  probabilities,  . 

is  assumed  to  be  provided  by  the  DVR  subsystem,  indicating  its 
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estimated  probability  that  the  last  digit  spoken  is  a particular 
vocabulary  item.  This  may  be  zero  (or  at  least  vanishingly  small) 
for  all  but  one  machine  type,  or  any  other  distribution  over  the 
set  of  machine  types. 


Aa  before,  we  model  the  conditional  probability  of  the  observation 
JL  • & , <?€  , /S  ) as  the  product  of  the  condition  proba- 

bilities of  its  components,  thus  invoking  the  assumption  of  inde- 
pendence among  these  components: 

&5b(lUH>P™b  tol  H)  P"*  <W»0  B-ob  (4110  B-ob  (41 H) 


The  first  three  factors,  dealing  with  the  delays  between  nodes  and 
the  properties  of  each  potential  recognition,  will  be  modeled 
exactly  as  before.  The  last  term,  dealing  with  the  start  time  of 
potential  recognitions  will  be  modeled  as  follows.  The  mean  start 
time  of  recognition  of  the  first  spoken  digit  will  be  assumed  to 
be  the  same  as  the  mean  time  required  to  speak  the  non-digital  por- 
tion of  the  utterance,  and  will  be  supplied  by  the  IWR  subsystem. 

Let  this  value  be  . Then  the  distribution  of  start  times  for 
recognition  of  the  real  first  digit  spoken  will  be  assumed  to  be 
(discretized)  normal  with  mean  p^&nd  standard  deviation  equal  to 
a fraction  f (about  20*)  of  . The  distribution  of  start  times 
for  all  recognitions  (real  or  art if actual)  which  are  not  the  real 
first  digit  spoken  will  be  assumed  to  have  the  constant  density 
1/ {Jyr  throughout  the  region  of  interest.  Finally,  the  probability 
of  observing  all  of  the  start  times  for  all  the  potential  recognitions 
will  be  assumed  to  be  the  product  of  the  probabilities  of  observing 
each  individually.  That  is 


where 


.TO 


b (4H0-  IT  pC&wlH) 

ir*1T 
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j?(/Son*x|Hy 


hr 


:<&■ 


if  iris  "Hia  frit 
recoarixtion  inH 


( 


( 


Proceeding  exactly  as  before,  by  introducing  negative  natural 
logarithms  and  defining  costs  Cj:  associated  with  edge  6.;.  and 
Cj  associated  with  potential  recognition  7T-  (each  cost  Ming 
the  negative  natural  logarithm  of  a likelihood  ratio)  , one  obtains 
for  hypothesis  B consisting  of  recognitions  7T-  j TTj  , TTfe } 

-Jn  BtoB  (HUT)  —Jx\  mfrr^  ->6i 

+-L  $ +£  vD<+> 


«i.  «*+ 


V" 


where  all  terse  except  C,-  are  as  previously  defined.  The  term 
C*  ia  the  log  likelihood  ratio  corresponding  to  the  start  time 
observed  for  the  first  recognition  ia  H,  IT*  ; 

Ci'“^"fe^re  fhr  /V/hy 

-A*fB/5flr (VCf 

A“/(n  /V5F 

txkl 

C-  1/f 
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The  term  in  brackets  in  the  expression  for  Jin  (H  j-fl- ) is  again 
either  constant  for  all  hypotheses , or  by  suitable  modification 
of  the  cost  associated  with  every  node,  it  cam  be  made  to  be  so. 
Hence,  it  earn  be  neglected  in  our  search  for  the  best  hypothesis. 


The  solution  Is  still  the  hypothesis  which  minimizes  the  cost, 
but  the  cost  now  has  three  new  terms.  It  is  natural  to  interpret 
these  three  new  terms  as  costs  associated  with  parts  of  the  hypo- 
thesis. We  will  use  the  letter  a to  denote  these  costs  since  they 
arise  in  the  AIC  environment.  The  first  term. 


is  naturally  interpreted  as  a cost  associated  with  the  edge  joining 
the  first  and  second  potential,  recognitions.  Its  value  is 


The  second  term. 


^ -jfn  mflr,  \ mflrp  correspond  fr> 

•*"  a number  in  O0-3S 

otherwise. 


- -fx\  l^flncr^ 


is  naturally  interpreted  as  a cost  associated  with  the  third  po- 
tential recognition  of  an  hypothesis.  Its  value  is  the  negative 
natural  logarithm  of  the  IWR-supplied  probability  that  the  last 
digit  spoken  wa is  of  machine  type  m The  third  term, 

<V”  * < 
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is  more  or  less  naturally  interpreted  as  a cost  associated  with 
the  first  potential  recognition  of  am  hypothesis. 

The  problem,  reduces  then  to  finding  the  hypothesis  with  minimum 
cost  Ct  where 


l/vP* Vk'i+E.Cij  ct  • 

J rk*H 

The  critical  and  difficult  change  in  the  problem  is  that  the  cost 
associated  with  an  edge  or  a potential  recognition  now  depends  upon 
its  position  within  am  hypothesis.  The  crucial,  property  of  costs 
which  underlay  the  dynamic  programming  solution  used  in  the  first 
MINT  l described  in  detail  on  pagee  71  and  72  of  the  LISTEN  final 
report)  no  longer  holds  true.  The  necessary  modification  is  to 
associate  not  one  optimal  cost  and  pointer,  but  three  optimal  costs 
and  two  pointers^ with  each  node.  The  single  optimal  cost  used  before 
wae  the  to tad.  cost  of  the  minimal  cost  path  from  Start  to  the  node  in 
question;  that  is,  the  cost  of  the  best  partial  solution  leading  to 
that  node.  The  three  optimal  costs  in  the  new  formulation  are  the 
total  costs  of  the  best  partial  solution  leading  to  the  node  in 
question  if  it  is  regarded  as  the  first,  the  second  or  the  third 
digit  spoken.  These  costs  are  different,  and  in  general  the  op- 
timal predecessors  will  be  different  nodes  when  a node  is  considered 
as  the  second  or  the  third  node  of  an  hypothesis.  Thus  two  pointers 
are  needed. 

The  algorithm  required  is  then  quite  obvious.  The  best  hypothesis 
can  be  identified  as  the  potential  predecessor  of  End  which  has 
the  smallest  optimal  cost  of  the  third  kind  after  adding  the  cost 
associated  with  the  final  delay  (the  edge  incident  at  End  and  the 
node  in  question) . It  is  found  by  dynamic  programming  as  before. 
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using  the  three  costa.  If  we  denote  the  three  optimal  costs  asso- 
ciated with  each  node  as  and  , one  proceeds  by  pro- 

cessing each  potential  recognition,  IT,  , as  it  is  received  from 
MEX  as  follows: 

1.  The  optimal  path  through  this  node  as  the  first  word 

of  an  hypothesis  is  just  the  node  itself.  Therefore  the 
total  cost  of  the  optimal  path  of  length  1 through 
node  is  just  the  intrinsic  cost  of  the  node  (as  computed 
before,  reflecting  violations,  T and  L Counters,  asso- 
ciations and  possibly  an  a priori  consideration)  plus  its 
unique  cost  due  to  its  being  the  first  digit  of  the  hypo- 
thesis 

<£c*n  - <vn + <i 

2.  An  optimal  path  of  length  two  ending  at  this  node  con- 
sists of  this  node,  its  predecessor  and  the  edge  joining 
them.  The  total  cost  of  such  a partial  solution  is  the 
sum  of  the  normal  costs  associated  with  the  edge  and 
this  node,  together  with  the  optimal  cost  of  the  prede- 
cessor as  the  first  word-of  an  hypothesis,  and  the  special 
cost  associated  with  an  edge  joining  the  first  two  words 
of  an  hypothesis  (to  bias-  in  favor  of  the  interval  00-35).. 
Thus 

<£<*0  - Cj  * Min  «■  . « W cJf] 

V:  potential  J 

predecessor  of  it 


i 
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If,  however^  this  node  has  only  Start,  and  no  potential 
recognition,  as  potential  predecessor,  this  node  is  not 
a viable  candidate  as  the  second  member  of  an  hypothesis. 
Its  cost,  £}*  , in  this  case  should  be  very  large.  This 
effect  can  be  obtained  by  setting^S tart)  to  a very  large 
number.  The  minimization  above  can  then  be  carried  out 
over  all  potential  predecessor  nodes,  regardless  of 
whether  they  are  potential  recognitions  or  Start. 


The  pointer  ^ (which  points  to  optimal  predecessor)  is 
set  to  point  at  a node  which  minimizes  . 


3.  Similarly,  an  optimal  path  of  length  three  ending  at  node 
TTj  is  a path  of  length  two,  plus  an  edge  plus  this  node. 
Its  cost  is  the  sum  of  the  normal  cost  plus  tha  cost  re- 
flecting the  IWR  bias  towards  certain  last  words.  The 
result  is 


<£<T.')  - Cj  +<3,(0  +•  M'n 

nj  a potential 
pred«cess«<~  ofiTj 

Again  the  possibility  that  the  node  has  no  potential 
predecessor  viable  as  the  second  word  of  an  hypothesis 
is  nicely  handled  by  setting  qj Start)  to  a high  value. 
The  pointer,  p,  , is  determined  in  the  search  for  mini- 
“ q,  - just  as  pr  was . 
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Aa  stated  above,  the  optimal  hypothesis  Is  identifiable  as  the 
potential  predecessor  of  End  whose  value,  when  added  to  the 
cost  of  the  edge  joining  it  to  End,  is  minimal.  Searching  for 
this  optimal  hypothesis  can  therefore  be  accomplished  in  much 
the  same  way  as  optimal  costs  and  pointers  are  determined  for 
any  other  node.  It  is  more  natural,  however,  to  modify  MINT  to 
treat  the  processing  of  End  as  a special  case. 
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APPENDIX  B 

CURVE  FITTING  ROUTINE 


The  curve  fitting  routine  SQUISH  is  designed  to  fit  a curve  through  the 
observed  points  of  the  cumulative  distribution  of  the  Qy  quality  function 
values  for  both  real  and  artifact  recognitions. 

More  specifically,  the  observed  points  which  must  be  fit  are  points  of  the 
cumulative  distribution 

Pr(6)  - Prob  [QT( w)  <6  I w is  real] 

for  a real  recognition,  and 

Pa(6)  - Prob  [ Qt( w)  <6  | ir  is  an  artifact] 


for  an  artifact. 


To  do  this  ’’fitting'*,  three  separate  problems  must  be  handled: 

(1)  An  appropriate  functional  form  for  the  " fitting"  function  must*  be 
found. 

(2)  The  standard  or  norm  of  approximation  (e.g.  least  squares,  mlnimax) 
must  be  chosen. 

(3)  An  algorithm  to  accomplish  the  fitting  must  be  designed. 

After  several  false  starts  and  dead  ends,  a solution  to  (I)  was  found.  The 
fitting  function  chosen  was  the  three  parameter  function 


Q(«)  - 1 


8S-y6 

e““* 


where  a,  0,  and  Y are  positive  real  numbers  to  be  determined. 


The  solution  of  (2)  was  to  use  a minimax  approximation.  That  is,  the  fit 
must  have  the  property  that  the  maximum  deviation  between  each  pair  of 
observed  value  Yj  and  predicted  value  Q( is  minimal.  In  other  words, 
the  fit  is  to  be  an  aproxlmatlon  in  the  supramum  norm. 


The  algorithm  employed  to  do  the  fitting  — that  is,  to  find  the  coefficients 
a,  0,  and  Y for  each  set  of  data  — relies  upon  the  observation  that  the  sort 
of  minlnax  approximation  which  is  sought  must  have  a special  property. 
Namely,  there  must  exist  exactly  four  points  ^ ^ * *nd  ^ at  which  the 
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maximum  errors  occur,  and  the  maximum  errors  must  have  the  further  property 
that  they  are  equal  in  absolute  value  and  alternating  in  sign.  So  if  Yt 
is  the  observed  value  at  6 and  Q(6  ) is  the  predicted  value  at  6 then  the 
errors  - Y^  - Q( 6 must  satisfy  1 
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The  actual  fitting  procedure  begins  by  finding  initial  guesses  at  the  o,  8, 
and  Y parameters,  determining  at  each  stage  the  four  points  of  worst  fit, 
and  then  refining  the  choices  of  o,  8,  and  Y until  minimax  is  achieved. 


The  typical  error  at  mlnimax  so  far  observed  is  on  the  order  of  +■  0.04. 


Another  curve  fitting  routine,  QDFIT,  is  designed  to  fit  a parabola  to 
computed  values  of  the  negative  log-likelihood  ratio 
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Q’ 
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The  values  o , 6 , Y , o0,  B-,  and  y are  provided  by  SQUISH  for  each  set  of 
real  and  artifact  dath.  The  procedure  used  in  QDFIT  is  a simple  least 
squares  fit  of  a parabola  to  the  computed  log-likelihood  ratios. 
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